Supervised Learning : Leveraging Ensemble Learning With Bagging, Boosting, Stacking and Blending Approaches¶


John Pauline Pineda

March 12, 2025


  • 1. Table of Contents
    • 1.1 Data Background
    • 1.2 Data Description
    • 1.3 Data Quality Assessment
    • 1.4 Data Preprocessing
      • 1.4.1 Data Splitting
      • 1.4.2 Data Profiling
      • 1.4.3 Category Aggregation and Encoding
      • 1.4.4 Outlier and Distributional Shape Analysis
      • 1.4.5 Collinearity
    • 1.5 Data Exploration
      • 1.5.1 Exploratory Data Analysis
      • 1.5.2 Hypothesis Testing
    • 1.6 Premodelling Data Preparation
      • 1.6.1 Preprocessed Data Description
      • 1.6.2 Preprocessing Pipeline Development
    • 1.7 Bagged Model Development
      • 1.7.1 Random Forest
      • 1.7.2 Extra Trees
      • 1.7.3 Bagged Decision Trees
      • 1.7.4 Bagged Logistic Regression
      • 1.7.5 Bagged Support Vector Machine
    • 1.8 Boosted Model Development
      • 1.8.1 AdaBoost
      • 1.8.2 Gradient Boosting
      • 1.8.3 XGBoost
      • 1.8.4 Light GBM
      • 1.8.5 CatBoost
    • 1.9 Stacked Model Development
      • 1.9.1 Base Learner - K-Nearest Neighbors
      • 1.9.2 Base Learner - Support Vector Machine
      • 1.9.3 Base Learner - Ridge Classifier
      • 1.9.4 Base Learner - Neural Network
      • 1.9.5 Base Learner - Decision Tree
      • 1.9.6 Meta Learner - Logistic Regression
    • 1.10 Blended Model Development
      • 1.10.1 Base Learner - K-Nearest Neighbors
      • 1.10.2 Base Learner - Support Vector Machine
      • 1.10.3 Base Learner - Ridge Classifier
      • 1.10.4 Base Learner - Neural Network
      • 1.10.5 Base Learner - Decision Tree
      • 1.10.6 Meta Learner - Logistic Regression
    • 1.11 Consolidated Findings
  • 2. Summary
  • 3. References

1. Table of Contents ¶

This project explores different Ensemble Learning approaches which combine the predictions from multiple models in an effort to achieve better predictive performance using various helpful packages in Python. The ensemble frameworks applied in the analysis were grouped into three classes including the Bagging Approach which fits many individual learners on different samples of the same dataset and averages the predictions; Boosting Approach which adds ensemble members sequentially that correct the predictions made by prior models and outputs a weighted average of the predictions; and Stacking or Blending Approach which consolidates many different and diverse learners on the same data and uses another model to learn how to best combine the predictions. Bagged models applied were the Random Forest, Extra Trees, Bagged Decision Tree, Bagged Logistic Regression and Bagged Support Vector Machine algorithms. Boosting models included the AdaBoost, Stochastic Gradient Boosting, Extreme Gradient Boosting, Light Gradient Boosting Machines and CatBoost algorithms. Individual base learners including the K-Nearest Neighbors, Support Vector Machine, Ridge Classifier, Neural Network and Decision Tree algorithms were stacked or blended together as contributors to the Logistic Regression meta-model. The resulting predictions derived from all ensemble learning models were independtly evaluated on a test set based on accuracy and F1 score metrics. All results were consolidated in a Summary presented at the end of the document.

Ensemble Learning is a machine learning technique that improves predictive accuracy by combining multiple models to leverage their collective strengths. Traditional machine learning models often struggle with either high bias, which leads to overly simplistic predictions, or high variance, which makes them too sensitive to fluctuations in the data. Ensemble learning addresses these challenges by aggregating the outputs of several models, creating a more robust and reliable predictor. In classification problems, this can be done through majority voting, weighted averaging, or more advanced meta-learning techniques. The key advantage of ensemble learning is its ability to reduce both bias and variance, leading to better generalization on unseen data. However, this comes at the cost of increased computational complexity and interpretability, as managing multiple models requires more resources and makes it harder to explain predictions.

Bagging (Bootstrap Aggregating) is an ensemble learning technique that reduces model variance by training multiple instances of the same algorithm on different randomly sampled subsets of the training data. The fundamental problem bagging aims to solve is overfitting, particularly in high-variance models. By generating multiple bootstrap samples—random subsets created through sampling with replacement — bagging ensures that each model is trained on slightly different data, making the overall prediction more stable. In classification problems, the final output is obtained by majority voting among the individual models, while in regression, their predictions are averaged. Bagging is particularly effective when dealing with noisy datasets, as it smooths out individual model errors. However, its effectiveness is limited for low-variance models, and the requirement to train multiple models increases computational cost.

Boosting is an ensemble learning method that builds a strong classifier by training models sequentially, where each new model focuses on correcting the mistakes of its predecessors. Boosting assigns higher weights to misclassified instances, ensuring that subsequent models pay more attention to these hard-to-classify cases. The motivation behind boosting is to reduce both bias and variance by iteratively refining weak learners — models that perform only slightly better than random guessing — until they collectively form a strong classifier. In classification tasks, predictions are refined by combining weighted outputs of multiple weak models, typically decision stumps or shallow trees. This makes boosting highly effective in uncovering complex patterns in data. However, the sequential nature of boosting makes it computationally expensive compared to bagging, and it is more prone to overfitting if the number of weak learners is too high.

Stacking, or stacked generalization, is an advanced ensemble method that improves predictive performance by training a meta-model to learn the optimal way to combine multiple base models using their out-of-fold predictions. Unlike traditional ensemble techniques such as bagging and boosting, which aggregate predictions through simple rules like averaging or majority voting, stacking introduces a second-level model that intelligently learns how to integrate diverse base models. The process starts by training multiple classifiers on the training dataset. However, instead of directly using their predictions, stacking employs k-fold cross-validation to generate out-of-fold predictions. Specifically, each base model is trained on a subset of the training data while leaving out a validation fold, and predictions on that unseen fold are recorded. This process is repeated across all folds, ensuring that each instance in the training data receives predictions from models that never saw it during training. These out-of-fold predictions are then used as input features for a meta-model, which learns the best way to combine them into a final decision. The advantage of stacking is that it allows different models to complement each other, capturing diverse aspects of the data that a single model might miss. This often results in superior classification accuracy compared to individual models or simpler ensemble approaches. However, stacking is computationally expensive, requiring multiple training iterations for base models and the additional meta-model. It also demands careful tuning to prevent overfitting, as the meta-model’s complexity can introduce new sources of error. Despite these challenges, stacking remains a powerful technique in applications where maximizing predictive performance is a priority.

Blending is an ensemble technique that enhances classification accuracy by training a meta-model on a holdout validation set, rather than using out-of-fold predictions like stacking. This simplifies implementation while maintaining the benefits of combining multiple base models. The process of blending starts by training base models on the full training dataset. Instead of applying cross-validation to obtain out-of-fold predictions, blending reserves a small portion of the training data as a holdout set. The base models make predictions on this unseen holdout set, and these predictions are then used as input features for a meta-model, which learns how to optimally combine them into a final classification decision. Since the meta-model is trained on predictions from unseen data, it avoids the risk of overfitting that can sometimes occur when base models are evaluated on the same data they were trained on. Blending is motivated by its simplicity and ease of implementation compared to stacking, as it eliminates the need for repeated k-fold cross-validation to generate training data for the meta-model. However, one drawback is that the meta-model has access to fewer training examples, as a portion of the data is withheld for validation rather than being used for training. This can limit the generalization ability of the final model, especially if the holdout set is too small. Despite this limitation, blending remains a useful approach in applications where a quick and effective ensemble method is needed without the computational overhead of stacking.

1.1. Data Background ¶

An open Thyroid Disease Dataset from Kaggle (with all credits attributed to Jai Naru and Abuchi Onwuegbusi) was used for the analysis as consolidated from the following primary sources:

  1. Reference Repository entitled Differentiated Thyroid Cancer Recurrence from UC Irvine Machine Learning Repository
  2. Research Paper entitled Machine Learning for Risk Stratification of Thyroid Cancer Patients: a 15-year Cohort Study from the European Archives of Oto-Rhino-Laryngology

This study hypothesized that the various clinicopathological characteristics influence differentiated thyroid cancer recurrence between patients.

The dichotomous categorical variable for the study is:

  • Recurred - Status of the patient (Yes, Recurrence of differentiated thyroid cancer | No, No recurrence of differentiated thyroid cancer)

The predictor variables for the study are:

  • Age - Patient's age (Years)
  • Gender - Patient's sex (M | F)
  • Smoking - Indication of smoking (Yes | No)
  • Hx Smoking - Indication of smoking history (Yes | No)
  • Hx Radiotherapy - Indication of radiotherapy history for any condition (Yes | No)
  • Thyroid Function - Status of thyroid function (Clinical Hyperthyroidism, Hypothyroidism | Subclinical Hyperthyroidism, Hypothyroidism | Euthyroid)
  • Physical Examination - Findings from physical examination including palpation of the thyroid gland and surrounding structures (Normal | Diffuse Goiter | Multinodular Goiter | Single Nodular Goiter Left, Right)
  • Adenopathy - Indication of enlarged lymph nodes in the neck region (No | Right | Extensive | Left | Bilateral | Posterior)
  • Pathology - Specific thyroid cancer type as determined by pathology examination of biopsy samples (Follicular | Hurthel Cell | Micropapillary | Papillary)
  • Focality - Indication if the cancer is limited to one location or present in multiple locations (Uni-Focal | Multi-Focal)
  • Risk - Risk category of the cancer based on various factors, such as tumor size, extent of spread, and histological type (Low | Intermediate | High)
  • T - Tumor classification based on its size and extent of invasion into nearby structures (T1a | T1b | T2 | T3a | T3b | T4a | T4b)
  • N - Nodal classification indicating the involvement of lymph nodes (N0 | N1a | N1b)
  • M - Metastasis classification indicating the presence or absence of distant metastases (M0 | M1)
  • Stage - Overall stage of the cancer, typically determined by combining T, N, and M classifications (I | II | III | IVa | IVb)
  • Response - Cancer's response to treatment (Biochemical Incomplete | Indeterminate | Excellent | Structural Incomplete)

1.2. Data Description ¶

  1. The initial tabular dataset was comprised of 383 observations and 17 variables (including 1 target and 16 predictors).
    • 383 rows (observations)
    • 17 columns (variables)
      • 1/17 target (categorical)
        • Recurred
      • 1/17 predictor (numeric)
        • Age
      • 16/17 predictor (categorical)
        • Gender
        • Smoking
        • Hx_Smoking
        • Hx_Radiotherapy
        • Thyroid_Function
        • Physical_Examination
        • Adenopathy
        • Pathology
        • Focality
        • Risk
        • T
        • N
        • M
        • Stage
        • Response
In [1]:
##################################
# Loading Python Libraries
##################################
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import joblib
import itertools
import os
import pickle
%matplotlib inline

from operator import add,mul,truediv
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.preprocessing import OrdinalEncoder, StandardScaler
from scipy import stats
from scipy.stats import pointbiserialr, chi2_contingency

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, BaggingClassifier, AdaBoostClassifier, GradientBoostingClassifier, StackingClassifier
from sklearn.neural_network import MLPClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix, ConfusionMatrixDisplay, classification_report
from sklearn.model_selection import train_test_split, GridSearchCV, RepeatedStratifiedKFold, KFold, cross_val_score
from sklearn.inspection import permutation_importance
In [2]:
##################################
# Defining file paths
##################################
DATASETS_ORIGINAL_PATH = r"datasets\original"
DATASETS_FINAL_PATH = r"datasets\final\complete"
DATASETS_FINAL_TRAIN_PATH = r"datasets\final\train"
DATASETS_FINAL_TRAIN_FEATURES_PATH = r"datasets\final\train\features"
DATASETS_FINAL_TRAIN_TARGET_PATH = r"datasets\final\train\target"
DATASETS_FINAL_VALIDATION_PATH = r"datasets\final\validation"
DATASETS_FINAL_VALIDATION_FEATURES_PATH = r"datasets\final\validation\features"
DATASETS_FINAL_VALIDATION_TARGET_PATH = r"datasets\final\validation\target"
DATASETS_FINAL_TEST_PATH = r"datasets\final\test"
DATASETS_FINAL_TEST_FEATURES_PATH = r"datasets\final\test\features"
DATASETS_FINAL_TEST_TARGET_PATH = r"datasets\final\test\target"
DATASETS_PREPROCESSED_PATH = r"datasets\preprocessed"
DATASETS_PREPROCESSED_TRAIN_PATH = r"datasets\preprocessed\train"
DATASETS_PREPROCESSED_TRAIN_FEATURES_PATH = r"datasets\preprocessed\train\features"
DATASETS_PREPROCESSED_TRAIN_TARGET_PATH = r"datasets\preprocessed\train\target"
DATASETS_PREPROCESSED_VALIDATION_PATH = r"datasets\preprocessed\validation"
DATASETS_PREPROCESSED_VALIDATION_FEATURES_PATH = r"datasets\preprocessed\validation\features"
DATASETS_PREPROCESSED_VALIDATION_TARGET_PATH = r"datasets\preprocessed\validation\target"
DATASETS_PREPROCESSED_TEST_PATH = r"datasets\preprocessed\test"
DATASETS_PREPROCESSED_TEST_FEATURES_PATH = r"datasets\preprocessed\test\features"
DATASETS_PREPROCESSED_TEST_TARGET_PATH = r"datasets\preprocessed\test\target"
MODELS_PATH = r"models"
In [3]:
##################################
# Loading the dataset
# from the DATASETS_ORIGINAL_PATH
##################################
thyroid_cancer = pd.read_csv(os.path.join("..", DATASETS_ORIGINAL_PATH, "Thyroid_Diff.csv"))
In [4]:
##################################
# Performing a general exploration of the dataset
##################################
print('Dataset Dimensions: ')
display(thyroid_cancer.shape)
Dataset Dimensions: 
(383, 17)
In [5]:
##################################
# Listing the column names and data types
##################################
print('Column Names and Data Types:')
display(thyroid_cancer.dtypes)
Column Names and Data Types:
Age                      int64
Gender                  object
Smoking                 object
Hx Smoking              object
Hx Radiotherapy         object
Thyroid Function        object
Physical Examination    object
Adenopathy              object
Pathology               object
Focality                object
Risk                    object
T                       object
N                       object
M                       object
Stage                   object
Response                object
Recurred                object
dtype: object
In [6]:
##################################
# Renaming and standardizing the column names
# to replace blanks with undercores
##################################
thyroid_cancer.columns = thyroid_cancer.columns.str.replace(" ", "_")
In [7]:
##################################
# Taking a snapshot of the dataset
##################################
thyroid_cancer.head()
Out[7]:
Age Gender Smoking Hx_Smoking Hx_Radiotherapy Thyroid_Function Physical_Examination Adenopathy Pathology Focality Risk T N M Stage Response Recurred
0 27 F No No No Euthyroid Single nodular goiter-left No Micropapillary Uni-Focal Low T1a N0 M0 I Indeterminate No
1 34 F No Yes No Euthyroid Multinodular goiter No Micropapillary Uni-Focal Low T1a N0 M0 I Excellent No
2 30 F No No No Euthyroid Single nodular goiter-right No Micropapillary Uni-Focal Low T1a N0 M0 I Excellent No
3 62 F No No No Euthyroid Single nodular goiter-right No Micropapillary Uni-Focal Low T1a N0 M0 I Excellent No
4 62 F No No No Euthyroid Multinodular goiter No Micropapillary Multi-Focal Low T1a N0 M0 I Excellent No
In [8]:
##################################
# Selecting categorical columns (both object and categorical types)
# and listing the unique categorical levels
##################################
cat_cols = thyroid_cancer.select_dtypes(include=["object", "category"]).columns
for col in cat_cols:
    print(f"Categorical | Object Column: {col}")
    print(thyroid_cancer[col].unique())  
    print("-" * 40)
    
Categorical | Object Column: Gender
['F' 'M']
----------------------------------------
Categorical | Object Column: Smoking
['No' 'Yes']
----------------------------------------
Categorical | Object Column: Hx_Smoking
['No' 'Yes']
----------------------------------------
Categorical | Object Column: Hx_Radiotherapy
['No' 'Yes']
----------------------------------------
Categorical | Object Column: Thyroid_Function
['Euthyroid' 'Clinical Hyperthyroidism' 'Clinical Hypothyroidism'
 'Subclinical Hyperthyroidism' 'Subclinical Hypothyroidism']
----------------------------------------
Categorical | Object Column: Physical_Examination
['Single nodular goiter-left' 'Multinodular goiter'
 'Single nodular goiter-right' 'Normal' 'Diffuse goiter']
----------------------------------------
Categorical | Object Column: Adenopathy
['No' 'Right' 'Extensive' 'Left' 'Bilateral' 'Posterior']
----------------------------------------
Categorical | Object Column: Pathology
['Micropapillary' 'Papillary' 'Follicular' 'Hurthel cell']
----------------------------------------
Categorical | Object Column: Focality
['Uni-Focal' 'Multi-Focal']
----------------------------------------
Categorical | Object Column: Risk
['Low' 'Intermediate' 'High']
----------------------------------------
Categorical | Object Column: T
['T1a' 'T1b' 'T2' 'T3a' 'T3b' 'T4a' 'T4b']
----------------------------------------
Categorical | Object Column: N
['N0' 'N1b' 'N1a']
----------------------------------------
Categorical | Object Column: M
['M0' 'M1']
----------------------------------------
Categorical | Object Column: Stage
['I' 'II' 'IVB' 'III' 'IVA']
----------------------------------------
Categorical | Object Column: Response
['Indeterminate' 'Excellent' 'Structural Incomplete'
 'Biochemical Incomplete']
----------------------------------------
Categorical | Object Column: Recurred
['No' 'Yes']
----------------------------------------
In [9]:
##################################
# Correcting a category level
##################################
thyroid_cancer["Pathology"] = thyroid_cancer["Pathology"].replace("Hurthel cell", "Hurthle Cell")
In [10]:
##################################
# Setting the levels of the categorical variables
##################################
thyroid_cancer['Recurred'] = thyroid_cancer['Recurred'].astype('category')
thyroid_cancer['Recurred'] = thyroid_cancer['Recurred'].cat.set_categories(['No', 'Yes'], ordered=True)
thyroid_cancer['Gender'] = thyroid_cancer['Gender'].astype('category')
thyroid_cancer['Gender'] = thyroid_cancer['Gender'].cat.set_categories(['M', 'F'], ordered=True)
thyroid_cancer['Smoking'] = thyroid_cancer['Smoking'].astype('category')
thyroid_cancer['Smoking'] = thyroid_cancer['Smoking'].cat.set_categories(['No', 'Yes'], ordered=True)
thyroid_cancer['Hx_Smoking'] = thyroid_cancer['Hx_Smoking'].astype('category')
thyroid_cancer['Hx_Smoking'] = thyroid_cancer['Hx_Smoking'].cat.set_categories(['No', 'Yes'], ordered=True)
thyroid_cancer['Hx_Radiotherapy'] = thyroid_cancer['Hx_Radiotherapy'].astype('category')
thyroid_cancer['Hx_Radiotherapy'] = thyroid_cancer['Hx_Radiotherapy'].cat.set_categories(['No', 'Yes'], ordered=True)
thyroid_cancer['Thyroid_Function'] = thyroid_cancer['Thyroid_Function'].astype('category')
thyroid_cancer['Thyroid_Function'] = thyroid_cancer['Thyroid_Function'].cat.set_categories(['Euthyroid', 'Subclinical Hypothyroidism', 'Subclinical Hyperthyroidism', 'Clinical Hypothyroidism', 'Clinical Hyperthyroidism'], ordered=True)
thyroid_cancer['Physical_Examination'] = thyroid_cancer['Physical_Examination'].astype('category')
thyroid_cancer['Physical_Examination'] = thyroid_cancer['Physical_Examination'].cat.set_categories(['Normal', 'Single nodular goiter-left', 'Single nodular goiter-right', 'Multinodular goiter', 'Diffuse goiter'], ordered=True)
thyroid_cancer['Adenopathy'] = thyroid_cancer['Adenopathy'].astype('category')
thyroid_cancer['Adenopathy'] = thyroid_cancer['Adenopathy'].cat.set_categories(['No', 'Left', 'Right', 'Bilateral', 'Posterior', 'Extensive'], ordered=True)
thyroid_cancer['Pathology'] = thyroid_cancer['Pathology'].astype('category')
thyroid_cancer['Pathology'] = thyroid_cancer['Pathology'].cat.set_categories(['Hurthle Cell', 'Follicular', 'Micropapillary', 'Papillary'], ordered=True)
thyroid_cancer['Focality'] = thyroid_cancer['Focality'].astype('category')
thyroid_cancer['Focality'] = thyroid_cancer['Focality'].cat.set_categories(['Uni-Focal', 'Multi-Focal'], ordered=True)
thyroid_cancer['Risk'] = thyroid_cancer['Risk'].astype('category')
thyroid_cancer['Risk'] = thyroid_cancer['Risk'].cat.set_categories(['Low', 'Intermediate', 'High'], ordered=True)
thyroid_cancer['T'] = thyroid_cancer['T'].astype('category')
thyroid_cancer['T'] = thyroid_cancer['T'].cat.set_categories(['T1a', 'T1b', 'T2', 'T3a', 'T3b', 'T4a', 'T4b'], ordered=True)
thyroid_cancer['N'] = thyroid_cancer['N'].astype('category')
thyroid_cancer['N'] = thyroid_cancer['N'].cat.set_categories(['N0', 'N1a', 'N1b'], ordered=True)
thyroid_cancer['M'] = thyroid_cancer['M'].astype('category')
thyroid_cancer['M'] = thyroid_cancer['M'].cat.set_categories(['M0', 'M1'], ordered=True)
thyroid_cancer['Stage'] = thyroid_cancer['Stage'].astype('category')
thyroid_cancer['Stage'] = thyroid_cancer['Stage'].cat.set_categories(['I', 'II', 'III', 'IVA', 'IVB'], ordered=True)
thyroid_cancer['Response'] = thyroid_cancer['Response'].astype('category')
thyroid_cancer['Response'] = thyroid_cancer['Response'].cat.set_categories(['Excellent', 'Structural Incomplete', 'Biochemical Incomplete', 'Indeterminate'], ordered=True)
In [11]:
##################################
# Performing a general exploration of the numeric variables
##################################
print('Numeric Variable Summary:')
display(thyroid_cancer.describe(include='number').transpose())
Numeric Variable Summary:
count mean std min 25% 50% 75% max
Age 383.0 40.866841 15.134494 15.0 29.0 37.0 51.0 82.0
In [12]:
##################################
# Performing a general exploration of the categorical variables
##################################
print('Categorical Variable Summary:')
display(thyroid_cancer.describe(include='category').transpose())
Categorical Variable Summary:
count unique top freq
Gender 383 2 F 312
Smoking 383 2 No 334
Hx_Smoking 383 2 No 355
Hx_Radiotherapy 383 2 No 376
Thyroid_Function 383 5 Euthyroid 332
Physical_Examination 383 5 Single nodular goiter-right 140
Adenopathy 383 6 No 277
Pathology 383 4 Papillary 287
Focality 383 2 Uni-Focal 247
Risk 383 3 Low 249
T 383 7 T2 151
N 383 3 N0 268
M 383 2 M0 365
Stage 383 5 I 333
Response 383 4 Excellent 208
Recurred 383 2 No 275
In [13]:
##################################
# Performing a general exploration of the categorical variable levels
# based on the ordered categories
##################################
ordered_cat_cols = thyroid_cancer.select_dtypes(include=["category"]).columns
for col in ordered_cat_cols:
    print(f"Column: {col}")
    print("Absolute Frequencies:")
    print(thyroid_cancer[col].value_counts().reindex(thyroid_cancer[col].cat.categories))
    print("\nNormalized Frequencies:")
    print(thyroid_cancer[col].value_counts(normalize=True).reindex(thyroid_cancer[col].cat.categories))
    print("-" * 50)
   
Column: Gender
Absolute Frequencies:
M     71
F    312
Name: count, dtype: int64

Normalized Frequencies:
M    0.185379
F    0.814621
Name: proportion, dtype: float64
--------------------------------------------------
Column: Smoking
Absolute Frequencies:
No     334
Yes     49
Name: count, dtype: int64

Normalized Frequencies:
No     0.872063
Yes    0.127937
Name: proportion, dtype: float64
--------------------------------------------------
Column: Hx_Smoking
Absolute Frequencies:
No     355
Yes     28
Name: count, dtype: int64

Normalized Frequencies:
No     0.926893
Yes    0.073107
Name: proportion, dtype: float64
--------------------------------------------------
Column: Hx_Radiotherapy
Absolute Frequencies:
No     376
Yes      7
Name: count, dtype: int64

Normalized Frequencies:
No     0.981723
Yes    0.018277
Name: proportion, dtype: float64
--------------------------------------------------
Column: Thyroid_Function
Absolute Frequencies:
Euthyroid                      332
Subclinical Hypothyroidism      14
Subclinical Hyperthyroidism      5
Clinical Hypothyroidism         12
Clinical Hyperthyroidism        20
Name: count, dtype: int64

Normalized Frequencies:
Euthyroid                      0.866841
Subclinical Hypothyroidism     0.036554
Subclinical Hyperthyroidism    0.013055
Clinical Hypothyroidism        0.031332
Clinical Hyperthyroidism       0.052219
Name: proportion, dtype: float64
--------------------------------------------------
Column: Physical_Examination
Absolute Frequencies:
Normal                           7
Single nodular goiter-left      89
Single nodular goiter-right    140
Multinodular goiter            140
Diffuse goiter                   7
Name: count, dtype: int64

Normalized Frequencies:
Normal                         0.018277
Single nodular goiter-left     0.232376
Single nodular goiter-right    0.365535
Multinodular goiter            0.365535
Diffuse goiter                 0.018277
Name: proportion, dtype: float64
--------------------------------------------------
Column: Adenopathy
Absolute Frequencies:
No           277
Left          17
Right         48
Bilateral     32
Posterior      2
Extensive      7
Name: count, dtype: int64

Normalized Frequencies:
No           0.723238
Left         0.044386
Right        0.125326
Bilateral    0.083551
Posterior    0.005222
Extensive    0.018277
Name: proportion, dtype: float64
--------------------------------------------------
Column: Pathology
Absolute Frequencies:
Hurthle Cell       20
Follicular         28
Micropapillary     48
Papillary         287
Name: count, dtype: int64

Normalized Frequencies:
Hurthle Cell      0.052219
Follicular        0.073107
Micropapillary    0.125326
Papillary         0.749347
Name: proportion, dtype: float64
--------------------------------------------------
Column: Focality
Absolute Frequencies:
Uni-Focal      247
Multi-Focal    136
Name: count, dtype: int64

Normalized Frequencies:
Uni-Focal      0.644909
Multi-Focal    0.355091
Name: proportion, dtype: float64
--------------------------------------------------
Column: Risk
Absolute Frequencies:
Low             249
Intermediate    102
High             32
Name: count, dtype: int64

Normalized Frequencies:
Low             0.650131
Intermediate    0.266319
High            0.083551
Name: proportion, dtype: float64
--------------------------------------------------
Column: T
Absolute Frequencies:
T1a     49
T1b     43
T2     151
T3a     96
T3b     16
T4a     20
T4b      8
Name: count, dtype: int64

Normalized Frequencies:
T1a    0.127937
T1b    0.112272
T2     0.394256
T3a    0.250653
T3b    0.041775
T4a    0.052219
T4b    0.020888
Name: proportion, dtype: float64
--------------------------------------------------
Column: N
Absolute Frequencies:
N0     268
N1a     22
N1b     93
Name: count, dtype: int64

Normalized Frequencies:
N0     0.699739
N1a    0.057441
N1b    0.242820
Name: proportion, dtype: float64
--------------------------------------------------
Column: M
Absolute Frequencies:
M0    365
M1     18
Name: count, dtype: int64

Normalized Frequencies:
M0    0.953003
M1    0.046997
Name: proportion, dtype: float64
--------------------------------------------------
Column: Stage
Absolute Frequencies:
I      333
II      32
III      4
IVA      3
IVB     11
Name: count, dtype: int64

Normalized Frequencies:
I      0.869452
II     0.083551
III    0.010444
IVA    0.007833
IVB    0.028721
Name: proportion, dtype: float64
--------------------------------------------------
Column: Response
Absolute Frequencies:
Excellent                 208
Structural Incomplete      91
Biochemical Incomplete     23
Indeterminate              61
Name: count, dtype: int64

Normalized Frequencies:
Excellent                 0.543081
Structural Incomplete     0.237598
Biochemical Incomplete    0.060052
Indeterminate             0.159269
Name: proportion, dtype: float64
--------------------------------------------------
Column: Recurred
Absolute Frequencies:
No     275
Yes    108
Name: count, dtype: int64

Normalized Frequencies:
No     0.718016
Yes    0.281984
Name: proportion, dtype: float64
--------------------------------------------------

1.3. Data Quality Assessment ¶

Data quality findings based on assessment are as follows:

  1. A total of 19 duplicated rows were identified.
    • In total, 34 observations were affected, consisting of 16 unique occurrences and 19 subsequent duplicates.
    • These 19 duplicates spanned 16 distinct variations, meaning some variations had multiple duplicates.
    • To clean the dataset, all 19 duplicate rows were removed, retaining only the first occurrence of each of the 16 unique variations.
  2. No missing data noted for any variable with Null.Count>0 and Fill.Rate<1.0.
  3. Low variance observed for 8 variables with First.Second.Mode.Ratio>5.
    • Hx_Radiotherapy: First.Second.Mode.Ratio = 51.000 (comprised 2 category levels)
    • M: First.Second.Mode.Ratio = 19.222 (comprised 2 category levels)
    • Thyroid_Function: First.Second.Mode.Ratio = 15.650 (comprised 5 category levels)
    • Hx_Smoking: First.Second.Mode.Ratio = 12.000 (comprised 2 category levels)
    • Stage: First.Second.Mode.Ratio = 9.812 (comprised 5 category levels)
    • Smoking: First.Second.Mode.Ratio = 6.428 (comprised 2 category levels)
    • Pathology: First.Second.Mode.Ratio = 6.022 (comprised 4 category levels)
    • Adenopathy: First.Second.Mode.Ratio = 5.375 (comprised 5 category levels)
  4. No low variance observed for any variable with Unique.Count.Ratio>10.
  5. No high skewness observed for any variable with Skewness>3 or Skewness<(-3).
In [14]:
##################################
# Counting the number of duplicated rows
##################################
thyroid_cancer.duplicated().sum()
Out[14]:
19
In [15]:
##################################
# Exploring the duplicated rows
##################################
duplicated_rows = thyroid_cancer[thyroid_cancer.duplicated(keep=False)]
display(duplicated_rows)
Age Gender Smoking Hx_Smoking Hx_Radiotherapy Thyroid_Function Physical_Examination Adenopathy Pathology Focality Risk T N M Stage Response Recurred
8 51 F No No No Euthyroid Single nodular goiter-right No Micropapillary Uni-Focal Low T1a N0 M0 I Excellent No
9 40 F No No No Euthyroid Single nodular goiter-right No Micropapillary Uni-Focal Low T1a N0 M0 I Excellent No
22 36 F No No No Euthyroid Single nodular goiter-right No Micropapillary Uni-Focal Low T1a N0 M0 I Excellent No
32 36 F No No No Euthyroid Single nodular goiter-right No Micropapillary Uni-Focal Low T1a N0 M0 I Excellent No
38 40 F No No No Euthyroid Single nodular goiter-right No Micropapillary Uni-Focal Low T1a N0 M0 I Excellent No
40 51 F No No No Euthyroid Single nodular goiter-right No Micropapillary Uni-Focal Low T1a N0 M0 I Excellent No
61 35 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T1b N0 M0 I Excellent No
66 35 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T1b N0 M0 I Excellent No
67 51 F No No No Euthyroid Single nodular goiter-left No Papillary Uni-Focal Low T1b N0 M0 I Excellent No
69 51 F No No No Euthyroid Single nodular goiter-left No Papillary Uni-Focal Low T1b N0 M0 I Excellent No
73 29 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T1b N0 M0 I Excellent No
77 29 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T1b N0 M0 I Excellent No
106 26 F No No No Euthyroid Multinodular goiter No Papillary Uni-Focal Low T2 N0 M0 I Excellent No
110 31 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T2 N0 M0 I Excellent No
113 32 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T2 N0 M0 I Excellent No
115 37 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T2 N0 M0 I Excellent No
119 28 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T2 N0 M0 I Excellent No
120 37 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T2 N0 M0 I Excellent No
121 26 F No No No Euthyroid Multinodular goiter No Papillary Uni-Focal Low T2 N0 M0 I Excellent No
123 28 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T2 N0 M0 I Excellent No
132 32 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T2 N0 M0 I Excellent No
136 21 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T2 N0 M0 I Excellent No
137 32 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T2 N0 M0 I Excellent No
138 26 F No No No Euthyroid Multinodular goiter No Papillary Uni-Focal Low T2 N0 M0 I Excellent No
142 42 F No No No Euthyroid Multinodular goiter No Papillary Uni-Focal Low T2 N0 M0 I Excellent No
161 22 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T2 N0 M0 I Excellent No
166 31 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T2 N0 M0 I Excellent No
168 21 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T2 N0 M0 I Excellent No
170 38 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T2 N0 M0 I Excellent No
175 34 F No No No Euthyroid Multinodular goiter No Papillary Uni-Focal Low T2 N0 M0 I Excellent No
178 38 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T2 N0 M0 I Excellent No
183 26 F No No No Euthyroid Multinodular goiter No Papillary Uni-Focal Low T2 N0 M0 I Excellent No
187 34 F No No No Euthyroid Multinodular goiter No Papillary Uni-Focal Low T2 N0 M0 I Excellent No
189 42 F No No No Euthyroid Multinodular goiter No Papillary Uni-Focal Low T2 N0 M0 I Excellent No
196 22 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T2 N0 M0 I Excellent No
In [16]:
##################################
# Checking if duplicated rows have identical values across all columns
##################################
num_unique_dup_rows = duplicated_rows.drop_duplicates().shape[0]
num_total_dup_rows = duplicated_rows.shape[0]
if num_unique_dup_rows == 1:
    print("All duplicated rows have the same values across all columns.")
else:
    print(f"There are {num_unique_dup_rows} unique versions among the {num_total_dup_rows} duplicated rows.")
    
There are 16 unique versions among the 35 duplicated rows.
In [17]:
##################################
# Counting the unique variations among duplicated rows
##################################
unique_dup_variations = duplicated_rows.drop_duplicates()
variation_counts = duplicated_rows.value_counts().reset_index(name="Count")
print("Unique duplicated row variations and their counts:")
display(variation_counts)
Unique duplicated row variations and their counts:
Age Gender Smoking Hx_Smoking Hx_Radiotherapy Thyroid_Function Physical_Examination Adenopathy Pathology Focality Risk T N M Stage Response Recurred Count
0 26 F No No No Euthyroid Multinodular goiter No Papillary Uni-Focal Low T2 N0 M0 I Excellent No 4
1 32 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T2 N0 M0 I Excellent No 3
2 21 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T2 N0 M0 I Excellent No 2
3 22 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T2 N0 M0 I Excellent No 2
4 28 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T2 N0 M0 I Excellent No 2
5 29 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T1b N0 M0 I Excellent No 2
6 31 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T2 N0 M0 I Excellent No 2
7 34 F No No No Euthyroid Multinodular goiter No Papillary Uni-Focal Low T2 N0 M0 I Excellent No 2
8 35 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T1b N0 M0 I Excellent No 2
9 36 F No No No Euthyroid Single nodular goiter-right No Micropapillary Uni-Focal Low T1a N0 M0 I Excellent No 2
10 37 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T2 N0 M0 I Excellent No 2
11 38 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T2 N0 M0 I Excellent No 2
12 40 F No No No Euthyroid Single nodular goiter-right No Micropapillary Uni-Focal Low T1a N0 M0 I Excellent No 2
13 42 F No No No Euthyroid Multinodular goiter No Papillary Uni-Focal Low T2 N0 M0 I Excellent No 2
14 51 F No No No Euthyroid Single nodular goiter-left No Papillary Uni-Focal Low T1b N0 M0 I Excellent No 2
15 51 F No No No Euthyroid Single nodular goiter-right No Micropapillary Uni-Focal Low T1a N0 M0 I Excellent No 2
In [18]:
##################################
# Removing the duplicated rows and
# retaining only the first occurrence
##################################
thyroid_cancer_row_filtered = thyroid_cancer.drop_duplicates(keep="first")
print('Dataset Dimensions: ')
display(thyroid_cancer_row_filtered.shape)
Dataset Dimensions: 
(364, 17)
In [19]:
##################################
# Gathering the data types for each column
##################################
data_type_list = list(thyroid_cancer_row_filtered.dtypes)
In [20]:
##################################
# Gathering the variable names for each column
##################################
variable_name_list = list(thyroid_cancer_row_filtered.columns)
In [21]:
##################################
# Gathering the number of observations for each column
##################################
row_count_list = list([len(thyroid_cancer_row_filtered)] * len(thyroid_cancer_row_filtered.columns))
In [22]:
##################################
# Gathering the number of missing data for each column
##################################
null_count_list = list(thyroid_cancer_row_filtered.isna().sum(axis=0))
In [23]:
##################################
# Gathering the number of non-missing data for each column
##################################
non_null_count_list = list(thyroid_cancer_row_filtered.count())
In [24]:
##################################
# Gathering the missing data percentage for each column
##################################
fill_rate_list = map(truediv, non_null_count_list, row_count_list)
In [25]:
##################################
# Formulating the summary
# for all columns
##################################
all_column_quality_summary = pd.DataFrame(zip(variable_name_list,
                                              data_type_list,
                                              row_count_list,
                                              non_null_count_list,
                                              null_count_list,
                                              fill_rate_list), 
                                        columns=['Column.Name',
                                                 'Column.Type',
                                                 'Row.Count',
                                                 'Non.Null.Count',
                                                 'Null.Count',                                                 
                                                 'Fill.Rate'])
display(all_column_quality_summary)
Column.Name Column.Type Row.Count Non.Null.Count Null.Count Fill.Rate
0 Age int64 364 364 0 1.0
1 Gender category 364 364 0 1.0
2 Smoking category 364 364 0 1.0
3 Hx_Smoking category 364 364 0 1.0
4 Hx_Radiotherapy category 364 364 0 1.0
5 Thyroid_Function category 364 364 0 1.0
6 Physical_Examination category 364 364 0 1.0
7 Adenopathy category 364 364 0 1.0
8 Pathology category 364 364 0 1.0
9 Focality category 364 364 0 1.0
10 Risk category 364 364 0 1.0
11 T category 364 364 0 1.0
12 N category 364 364 0 1.0
13 M category 364 364 0 1.0
14 Stage category 364 364 0 1.0
15 Response category 364 364 0 1.0
16 Recurred category 364 364 0 1.0
In [26]:
##################################
# Counting the number of columns
# with Fill.Rate < 1.00
##################################
len(all_column_quality_summary[(all_column_quality_summary['Fill.Rate']<1)])
Out[26]:
0
In [27]:
##################################
# Identifying the rows
# with Fill.Rate < 0.90
##################################
column_low_fill_rate = all_column_quality_summary[(all_column_quality_summary['Fill.Rate']<0.90)]
In [28]:
##################################
# Gathering the indices for each observation
##################################
row_index_list = thyroid_cancer_row_filtered.index
In [29]:
##################################
# Gathering the number of columns for each observation
##################################
column_count_list = list([len(thyroid_cancer_row_filtered.columns)] * len(thyroid_cancer_row_filtered))
In [30]:
##################################
# Gathering the number of missing data for each row
##################################
null_row_list = list(thyroid_cancer_row_filtered.isna().sum(axis=1))
In [31]:
##################################
# Gathering the missing data percentage for each column
##################################
missing_rate_list = map(truediv, null_row_list, column_count_list)
In [32]:
##################################
# Identifying the rows
# with missing data
##################################
all_row_quality_summary = pd.DataFrame(zip(row_index_list,
                                           column_count_list,
                                           null_row_list,
                                           missing_rate_list), 
                                        columns=['Row.Name',
                                                 'Column.Count',
                                                 'Null.Count',                                                 
                                                 'Missing.Rate'])
display(all_row_quality_summary)
Row.Name Column.Count Null.Count Missing.Rate
0 0 17 0 0.0
1 1 17 0 0.0
2 2 17 0 0.0
3 3 17 0 0.0
4 4 17 0 0.0
... ... ... ... ...
359 378 17 0 0.0
360 379 17 0 0.0
361 380 17 0 0.0
362 381 17 0 0.0
363 382 17 0 0.0

364 rows × 4 columns

In [33]:
##################################
# Counting the number of rows
# with Missing.Rate > 0.00
##################################
len(all_row_quality_summary[(all_row_quality_summary['Missing.Rate']>0.00)])
Out[33]:
0
In [34]:
##################################
# Formulating the dataset
# with numeric columns only
##################################
thyroid_cancer_numeric = thyroid_cancer_row_filtered.select_dtypes(include='number')
In [35]:
##################################
# Gathering the variable names for each numeric column
##################################
numeric_variable_name_list = thyroid_cancer_numeric.columns
In [36]:
##################################
# Gathering the minimum value for each numeric column
##################################
numeric_minimum_list = thyroid_cancer_numeric.min()
In [37]:
##################################
# Gathering the mean value for each numeric column
##################################
numeric_mean_list = thyroid_cancer_numeric.mean()
In [38]:
##################################
# Gathering the median value for each numeric column
##################################
numeric_median_list = thyroid_cancer_numeric.median()
In [39]:
##################################
# Gathering the maximum value for each numeric column
##################################
numeric_maximum_list = thyroid_cancer_numeric.max()
In [40]:
##################################
# Gathering the first mode values for each numeric column
##################################
numeric_first_mode_list = [thyroid_cancer_row_filtered[x].value_counts(dropna=True).index.tolist()[0] for x in thyroid_cancer_numeric]
In [41]:
##################################
# Gathering the second mode values for each numeric column
##################################
numeric_second_mode_list = [thyroid_cancer_row_filtered[x].value_counts(dropna=True).index.tolist()[1] for x in thyroid_cancer_numeric]
In [42]:
##################################
# Gathering the count of first mode values for each numeric column
##################################
numeric_first_mode_count_list = [thyroid_cancer_numeric[x].isin([thyroid_cancer_row_filtered[x].value_counts(dropna=True).index.tolist()[0]]).sum() for x in thyroid_cancer_numeric]
In [43]:
##################################
# Gathering the count of second mode values for each numeric column
##################################
numeric_second_mode_count_list = [thyroid_cancer_numeric[x].isin([thyroid_cancer_row_filtered[x].value_counts(dropna=True).index.tolist()[1]]).sum() for x in thyroid_cancer_numeric]
In [44]:
##################################
# Gathering the first mode to second mode ratio for each numeric column
##################################
numeric_first_second_mode_ratio_list = map(truediv, numeric_first_mode_count_list, numeric_second_mode_count_list)
In [45]:
##################################
# Gathering the count of unique values for each numeric column
##################################
numeric_unique_count_list = thyroid_cancer_numeric.nunique(dropna=True)
In [46]:
##################################
# Gathering the number of observations for each numeric column
##################################
numeric_row_count_list = list([len(thyroid_cancer_numeric)] * len(thyroid_cancer_numeric.columns))
In [47]:
##################################
# Gathering the unique to count ratio for each numeric column
##################################
numeric_unique_count_ratio_list = map(truediv, numeric_unique_count_list, numeric_row_count_list)
In [48]:
##################################
# Gathering the skewness value for each numeric column
##################################
numeric_skewness_list = thyroid_cancer_numeric.skew()
In [49]:
##################################
# Gathering the kurtosis value for each numeric column
##################################
numeric_kurtosis_list = thyroid_cancer_numeric.kurtosis()
In [50]:
##################################
# Generating a column quality summary for the numeric column
##################################
numeric_column_quality_summary = pd.DataFrame(zip(numeric_variable_name_list,
                                                numeric_minimum_list,
                                                numeric_mean_list,
                                                numeric_median_list,
                                                numeric_maximum_list,
                                                numeric_first_mode_list,
                                                numeric_second_mode_list,
                                                numeric_first_mode_count_list,
                                                numeric_second_mode_count_list,
                                                numeric_first_second_mode_ratio_list,
                                                numeric_unique_count_list,
                                                numeric_row_count_list,
                                                numeric_unique_count_ratio_list,
                                                numeric_skewness_list,
                                                numeric_kurtosis_list), 
                                        columns=['Numeric.Column.Name',
                                                 'Minimum',
                                                 'Mean',
                                                 'Median',
                                                 'Maximum',
                                                 'First.Mode',
                                                 'Second.Mode',
                                                 'First.Mode.Count',
                                                 'Second.Mode.Count',
                                                 'First.Second.Mode.Ratio',
                                                 'Unique.Count',
                                                 'Row.Count',
                                                 'Unique.Count.Ratio',
                                                 'Skewness',
                                                 'Kurtosis'])
display(numeric_column_quality_summary)
Numeric.Column.Name Minimum Mean Median Maximum First.Mode Second.Mode First.Mode.Count Second.Mode.Count First.Second.Mode.Ratio Unique.Count Row.Count Unique.Count.Ratio Skewness Kurtosis
0 Age 15 41.25 38.0 82 31 27 21 13 1.615385 65 364 0.178571 0.678269 -0.359255
In [51]:
##################################
# Counting the number of numeric columns
# with First.Second.Mode.Ratio > 5.00
##################################
len(numeric_column_quality_summary[(numeric_column_quality_summary['First.Second.Mode.Ratio']>5)])
Out[51]:
0
In [52]:
##################################
# Counting the number of numeric columns
# with Unique.Count.Ratio > 10.00
##################################
len(numeric_column_quality_summary[(numeric_column_quality_summary['Unique.Count.Ratio']>10)])
Out[52]:
0
In [53]:
##################################
# Counting the number of numeric columns
# with Skewness > 3.00 or Skewness < -3.00
##################################
len(numeric_column_quality_summary[(numeric_column_quality_summary['Skewness']>3) | (numeric_column_quality_summary['Skewness']<(-3))])
Out[53]:
0
In [54]:
##################################
# Formulating the dataset
# with categorical columns only
##################################
thyroid_cancer_categorical = thyroid_cancer_row_filtered.select_dtypes(include='category')
In [55]:
##################################
# Gathering the variable names for the categorical column
##################################
categorical_variable_name_list = thyroid_cancer_categorical.columns
In [56]:
##################################
# Gathering the first mode values for each categorical column
##################################
categorical_first_mode_list = [thyroid_cancer_row_filtered[x].value_counts().index.tolist()[0] for x in thyroid_cancer_categorical]
In [57]:
##################################
# Gathering the second mode values for each categorical column
##################################
categorical_second_mode_list = [thyroid_cancer_row_filtered[x].value_counts().index.tolist()[1] for x in thyroid_cancer_categorical]
In [58]:
##################################
# Gathering the count of first mode values for each categorical column
##################################
categorical_first_mode_count_list = [thyroid_cancer_categorical[x].isin([thyroid_cancer_row_filtered[x].value_counts(dropna=True).index.tolist()[0]]).sum() for x in thyroid_cancer_categorical]
In [59]:
##################################
# Gathering the count of second mode values for each categorical column
##################################
categorical_second_mode_count_list = [thyroid_cancer_categorical[x].isin([thyroid_cancer_row_filtered[x].value_counts(dropna=True).index.tolist()[1]]).sum() for x in thyroid_cancer_categorical]
In [60]:
##################################
# Gathering the first mode to second mode ratio for each categorical column
##################################
categorical_first_second_mode_ratio_list = map(truediv, categorical_first_mode_count_list, categorical_second_mode_count_list)
In [61]:
##################################
# Gathering the count of unique values for each categorical column
##################################
categorical_unique_count_list = thyroid_cancer_categorical.nunique(dropna=True)
In [62]:
##################################
# Gathering the number of observations for each categorical column
##################################
categorical_row_count_list = list([len(thyroid_cancer_categorical)] * len(thyroid_cancer_categorical.columns))
In [63]:
##################################
# Gathering the unique to count ratio for each categorical column
##################################
categorical_unique_count_ratio_list = map(truediv, categorical_unique_count_list, categorical_row_count_list)
In [64]:
##################################
# Generating a column quality summary for the categorical columns
##################################
categorical_column_quality_summary = pd.DataFrame(zip(categorical_variable_name_list,
                                                    categorical_first_mode_list,
                                                    categorical_second_mode_list,
                                                    categorical_first_mode_count_list,
                                                    categorical_second_mode_count_list,
                                                    categorical_first_second_mode_ratio_list,
                                                    categorical_unique_count_list,
                                                    categorical_row_count_list,
                                                    categorical_unique_count_ratio_list), 
                                        columns=['Categorical.Column.Name',
                                                 'First.Mode',
                                                 'Second.Mode',
                                                 'First.Mode.Count',
                                                 'Second.Mode.Count',
                                                 'First.Second.Mode.Ratio',
                                                 'Unique.Count',
                                                 'Row.Count',
                                                 'Unique.Count.Ratio'])
display(categorical_column_quality_summary)
Categorical.Column.Name First.Mode Second.Mode First.Mode.Count Second.Mode.Count First.Second.Mode.Ratio Unique.Count Row.Count Unique.Count.Ratio
0 Gender F M 293 71 4.126761 2 364 0.005495
1 Smoking No Yes 315 49 6.428571 2 364 0.005495
2 Hx_Smoking No Yes 336 28 12.000000 2 364 0.005495
3 Hx_Radiotherapy No Yes 357 7 51.000000 2 364 0.005495
4 Thyroid_Function Euthyroid Clinical Hyperthyroidism 313 20 15.650000 5 364 0.013736
5 Physical_Examination Multinodular goiter Single nodular goiter-right 135 127 1.062992 5 364 0.013736
6 Adenopathy No Right 258 48 5.375000 6 364 0.016484
7 Pathology Papillary Micropapillary 271 45 6.022222 4 364 0.010989
8 Focality Uni-Focal Multi-Focal 228 136 1.676471 2 364 0.005495
9 Risk Low Intermediate 230 102 2.254902 3 364 0.008242
10 T T2 T3a 138 96 1.437500 7 364 0.019231
11 N N0 N1b 249 93 2.677419 3 364 0.008242
12 M M0 M1 346 18 19.222222 2 364 0.005495
13 Stage I II 314 32 9.812500 5 364 0.013736
14 Response Excellent Structural Incomplete 189 91 2.076923 4 364 0.010989
15 Recurred No Yes 256 108 2.370370 2 364 0.005495
In [65]:
##################################
# Counting the number of categorical columns
# with First.Second.Mode.Ratio > 5.00
##################################
len(categorical_column_quality_summary[(categorical_column_quality_summary['First.Second.Mode.Ratio']>5)])
Out[65]:
8
In [66]:
##################################
# Identifying the categorical columns
# with First.Second.Mode.Ratio > 5.00
##################################
display(categorical_column_quality_summary[(categorical_column_quality_summary['First.Second.Mode.Ratio']>5)].sort_values(by=['First.Second.Mode.Ratio'], ascending=False))
Categorical.Column.Name First.Mode Second.Mode First.Mode.Count Second.Mode.Count First.Second.Mode.Ratio Unique.Count Row.Count Unique.Count.Ratio
3 Hx_Radiotherapy No Yes 357 7 51.000000 2 364 0.005495
12 M M0 M1 346 18 19.222222 2 364 0.005495
4 Thyroid_Function Euthyroid Clinical Hyperthyroidism 313 20 15.650000 5 364 0.013736
2 Hx_Smoking No Yes 336 28 12.000000 2 364 0.005495
13 Stage I II 314 32 9.812500 5 364 0.013736
1 Smoking No Yes 315 49 6.428571 2 364 0.005495
7 Pathology Papillary Micropapillary 271 45 6.022222 4 364 0.010989
6 Adenopathy No Right 258 48 5.375000 6 364 0.016484
In [67]:
##################################
# Counting the number of categorical columns
# with Unique.Count.Ratio > 10.00
##################################
len(categorical_column_quality_summary[(categorical_column_quality_summary['Unique.Count.Ratio']>10)])
Out[67]:
0

1.4. Data Preprocessing ¶

1.4.1 Data Splitting ¶

  1. The baseline dataset (with duplicate rows removed from the original dataset) is comprised of:
    • 364 rows (observations)
      • 256 Recurred=No: 70.33%
      • 108 Recurred=Yes: 29.67%
    • 17 columns (variables)
      • 1/17 target (categorical)
        • Recurred
      • 1/17 predictor (numeric)
        • Age
      • 15/17 predictor (categorical)
        • Gender
        • Smoking
        • Hx_Smoking
        • Hx_Radiotherapy
        • Thyroid_Function
        • Physical_Examination
        • Adenopathy
        • Pathology
        • Focality
        • Risk
        • T
        • N
        • M
        • Stage
        • Response
  2. The baseline dataset was divided into three subsets using a fixed random seed:
    • test data: 25% of the original data with class stratification applied
    • train data (initial): 75% of the original data with class stratification applied
      • train data (final): 75% of the train (initial) data with class stratification applied
      • validation data: 25% of the train (initial) data with class stratification applied
  3. Models were developed from the train data (final). Using the same dataset, a subset of models with optimal hyperparameters were selected, based on cross-validation.
  4. Among candidate models with optimal hyperparameters, the final model were selected based on performance on the validation data.
  5. Performance of the selected final model (and other candidate models for post-model selection comparison) were evaluated using the test data.
  6. The train data (final) subset is comprised of:
    • 204 rows (observations)
      • 143 Recurred=No: 70.10%
      • 61 Recurred=Yes: 29.90%
    • 17 columns (variables)
  7. The validation data subset is comprised of:
    • 69 rows (observations)
      • 49 Recurred=No: 71.01%
      • 20 Recurred=Yes: 28.98%
    • 17 columns (variables)
  8. The test data subset is comprised of:
    • 91 rows (observations)
      • 64 Recurred=No: 70.33%
      • 27 Recurred=Yes: 29.67%
    • 17 columns (variables)
In [68]:
##################################
# Creating a dataset copy
# of the row filtered data
##################################
thyroid_cancer_baseline = thyroid_cancer_row_filtered.copy()
In [69]:
##################################
# Performing a general exploration
# of the baseline dataset
##################################
print('Final Dataset Dimensions: ')
display(thyroid_cancer_baseline.shape)
Final Dataset Dimensions: 
(364, 17)
In [70]:
print('Target Variable Breakdown: ')
thyroid_cancer_breakdown = thyroid_cancer_baseline.groupby('Recurred', observed=True).size().reset_index(name='Count')
thyroid_cancer_breakdown['Percentage'] = (thyroid_cancer_breakdown['Count'] / len(thyroid_cancer_baseline)) * 100
display(thyroid_cancer_breakdown)
Target Variable Breakdown: 
Recurred Count Percentage
0 No 256 70.32967
1 Yes 108 29.67033
In [71]:
##################################
# Formulating the train and test data
# from the final dataset
# by applying stratification and
# using a 75-25 ratio
##################################
thyroid_cancer_train_initial, thyroid_cancer_test = train_test_split(thyroid_cancer_baseline, 
                                                               test_size=0.25, 
                                                               stratify=thyroid_cancer_baseline['Recurred'], 
                                                               random_state=987654321)
In [72]:
##################################
# Performing a general exploration
# of the initial training dataset
##################################
X_train_initial = thyroid_cancer_train_initial.drop('Recurred', axis = 1)
y_train_initial = thyroid_cancer_train_initial['Recurred']
print('Initial Train Dataset Dimensions: ')
display(X_train_initial.shape)
display(y_train_initial.shape)
print('Initial Train Target Variable Breakdown: ')
display(y_train_initial.value_counts())
print('Initial Train Target Variable Proportion: ')
display(y_train_initial.value_counts(normalize = True))
Initial Train Dataset Dimensions: 
(273, 16)
(273,)
Initial Train Target Variable Breakdown: 
Recurred
No     192
Yes     81
Name: count, dtype: int64
Initial Train Target Variable Proportion: 
Recurred
No     0.703297
Yes    0.296703
Name: proportion, dtype: float64
In [73]:
##################################
# Performing a general exploration
# of the test dataset
##################################
X_test = thyroid_cancer_test.drop('Recurred', axis = 1)
y_test = thyroid_cancer_test['Recurred']
print('Test Dataset Dimensions: ')
display(X_test.shape)
display(y_test.shape)
print('Test Target Variable Breakdown: ')
display(y_test.value_counts())
print('Test Target Variable Proportion: ')
display(y_test.value_counts(normalize = True))
Test Dataset Dimensions: 
(91, 16)
(91,)
Test Target Variable Breakdown: 
Recurred
No     64
Yes    27
Name: count, dtype: int64
Test Target Variable Proportion: 
Recurred
No     0.703297
Yes    0.296703
Name: proportion, dtype: float64
In [74]:
##################################
# Formulating the train and validation data
# from the train dataset
# by applying stratification and
# using a 75-25 ratio
##################################
thyroid_cancer_train, thyroid_cancer_validation = train_test_split(thyroid_cancer_train_initial, 
                                                             test_size=0.25, 
                                                             stratify=thyroid_cancer_train_initial['Recurred'], 
                                                             random_state=987654321)
In [75]:
##################################
# Performing a general exploration
# of the final training dataset
##################################
X_train = thyroid_cancer_train.drop('Recurred', axis = 1)
y_train = thyroid_cancer_train['Recurred']
print('Final Train Dataset Dimensions: ')
display(X_train.shape)
display(y_train.shape)
print('Final Train Target Variable Breakdown: ')
display(y_train.value_counts())
print('Final Train Target Variable Proportion: ')
display(y_train.value_counts(normalize = True))
Final Train Dataset Dimensions: 
(204, 16)
(204,)
Final Train Target Variable Breakdown: 
Recurred
No     143
Yes     61
Name: count, dtype: int64
Final Train Target Variable Proportion: 
Recurred
No     0.70098
Yes    0.29902
Name: proportion, dtype: float64
In [76]:
##################################
# Performing a general exploration
# of the validation dataset
##################################
X_validation = thyroid_cancer_validation.drop('Recurred', axis = 1)
y_validation = thyroid_cancer_validation['Recurred']
print('Validation Dataset Dimensions: ')
display(X_validation.shape)
display(y_validation.shape)
print('Validation Target Variable Breakdown: ')
display(y_validation.value_counts())
print('Validation Target Variable Proportion: ')
display(y_validation.value_counts(normalize = True))
Validation Dataset Dimensions: 
(69, 16)
(69,)
Validation Target Variable Breakdown: 
Recurred
No     49
Yes    20
Name: count, dtype: int64
Validation Target Variable Proportion: 
Recurred
No     0.710145
Yes    0.289855
Name: proportion, dtype: float64
In [77]:
##################################
# Saving the training data
# to the DATASETS_FINAL_TRAIN_PATH
# and DATASETS_FINAL_TRAIN_FEATURES_PATH
# and DATASETS_FINAL_TRAIN_TARGET_PATH
##################################
thyroid_cancer_train.to_csv(os.path.join("..", DATASETS_FINAL_TRAIN_PATH, "thyroid_cancer_train.csv"), index=False)
X_train.to_csv(os.path.join("..", DATASETS_FINAL_TRAIN_FEATURES_PATH, "X_train.csv"), index=False)
y_train.to_csv(os.path.join("..", DATASETS_FINAL_TRAIN_TARGET_PATH, "y_train.csv"), index=False)
In [78]:
##################################
# Saving the validation data
# to the DATASETS_FINAL_VALIDATION_PATH
# and DATASETS_FINAL_VALIDATION_FEATURE_PATH
# and DATASETS_FINAL_VALIDATION_TARGET_PATH
##################################
thyroid_cancer_validation.to_csv(os.path.join("..", DATASETS_FINAL_VALIDATION_PATH, "thyroid_cancer_validation.csv"), index=False)
X_validation.to_csv(os.path.join("..", DATASETS_FINAL_VALIDATION_FEATURES_PATH, "X_validation.csv"), index=False)
y_validation.to_csv(os.path.join("..", DATASETS_FINAL_VALIDATION_TARGET_PATH, "y_validation.csv"), index=False)
In [79]:
##################################
# Saving the test data
# to the DATASETS_FINAL_TEST_PATH
# and DATASETS_FINAL_TEST_FEATURES_PATH
# and DATASETS_FINAL_TEST_TARGET_PATH
##################################
thyroid_cancer_test.to_csv(os.path.join("..", DATASETS_FINAL_TEST_PATH, "thyroid_cancer_test.csv"), index=False)
X_test.to_csv(os.path.join("..", DATASETS_FINAL_TEST_FEATURES_PATH, "X_test.csv"), index=False)
y_test.to_csv(os.path.join("..", DATASETS_FINAL_TEST_TARGET_PATH, "y_test.csv"), index=False)

1.4.2 Data Profiling ¶

  1. No significant distributional anomalies were observed for the numeric predictor Age.
  2. 9 categorical predictors were observed with categories consisting of too few cases that risk poor generalization and cross-validation issues:
    • Thyroid_Function:
      • 171 Thyroid_Function=Euthyroid: 83.82%
      • 10 Thyroid_Function=Subclinical Hypothyroidism: 4.90%
      • 3 Thyroid_Function=Subclinical Hyperthyroidism: 1.47%
      • 7 Thyroid_Function=Clinical Hypothyroidism: 3.43%
      • 13 Thyroid_Function=Clinical Hyperthyroidism: 6.37%
    • Physical_Examination:
      • 4 Physical_Examination=Normal: 1.96%
      • 50 Physical_Examination=Single nodular goiter-left: 24.50%
      • 68 Physical_Examination=Single nodular goiter-right: 33.33%
      • 79 Physical_Examination=Multinodular goiter: 38.72%
      • 3 Physical_Examination=Diffuse goiter: 1.47%
    • Adenopathy:
      • 144 Adenopathy=No: 70.59%
      • 14 Adenopathy=Left: 6.86%
      • 21 Adenopathy=Right: 10.29%
      • 19 Adenopathy=Bilateral: 9.31%
      • 2 Adenopathy=Posterior: 9.84%
      • 4 Adenopathy=Extensive: 1.96%
    • Pathology:
      • 15 Pathology=Hurthle Cell: 7.35%
      • 14 Pathology=Follicular: 6.86%
      • 26 Pathology=Micropapillary: 12.74%
      • 149 Pathology=Papillary: 73.03%
    • Risk:
      • 127 Risk=Low: 62.25%
      • 60 Risk=Intermediate: 29.41%
      • 17 Risk=High: 8.33%
    • T:
      • 26 T=T1a: 12.74%
      • 21 T=T1b: 10.29%
      • 73 T=T2: 35.78%
      • 58 T=T3a: 28.43%
      • 10 T=T3b: 4.90%
      • 12 T=T4a: 5.88%
      • 4 T=T4b: 1.96%
    • N:
      • 139 N=N0: 68.13%
      • 11 N=N1a: 5.39%
      • 54 N=N1b: 26.47%
    • Stage:
      • 174 Stage=I: 85.29%
      • 21 Stage=II: 10.29%
      • 2 Stage=III: 0.98%
      • 2 Stage=IVA: 0.98%
      • 5 Stage=IVB: 2.45%
    • Response:
      • 109 Response=Excellent: 53.43%
      • 53 Response=Structural Incomplete: 25.98%
      • 8 Response=Biochemical Incomplete: 3.92%
      • 34 Response=Indeterminate: 16.67%
  3. 3 categorical predictors were excluded from the dataset after having been observed with extremely low variance containing categories with very few or almost no variations across observations that may have limited predictive power or drive increased model complexity without performance gains:
    • Hx_Smoking:
      • 193 Hx_Smoking=No: 94.61%
      • 11 Hx_Smoking=Yes: 5.39%
    • Hx_Radiotherapy:
      • 202 Hx_Radiotherapy=No: 99.02%
      • 2 Hx_Radiotherapy=Yes: 0.98%
    • M:
      • 194 M=M0: 95.10%
      • 10 M=M1: 4.90%
In [80]:
##################################
# Segregating the target
# and predictor variables
##################################
thyroid_cancer_train_predictors = thyroid_cancer_train.iloc[:,:-1].columns
thyroid_cancer_train_predictors_numeric = thyroid_cancer_train.iloc[:,:-1].loc[:, thyroid_cancer_train.iloc[:,:-1].columns == 'Age'].columns
thyroid_cancer_train_predictors_categorical = thyroid_cancer_train.iloc[:,:-1].loc[:,thyroid_cancer_train.iloc[:,:-1].columns != 'Age'].columns
In [81]:
##################################
# Gathering the variable names for each numeric column
##################################
numeric_variable_name_list = thyroid_cancer_train_predictors_numeric
In [82]:
##################################
# Segregating the target variable
# and numeric predictors
##################################
histogram_grouping_variable = 'Recurred'
histogram_frequency_variable = numeric_variable_name_list.values[0]
In [83]:
##################################
# Comparing the numeric predictors
# grouped by the target variable
##################################
colors = plt.get_cmap('tab10').colors
plt.figure(figsize=(7, 5))
group_no = thyroid_cancer_train[thyroid_cancer_train[histogram_grouping_variable] == 'No'][histogram_frequency_variable]
group_yes = thyroid_cancer_train[thyroid_cancer_train[histogram_grouping_variable] == 'Yes'][histogram_frequency_variable]
plt.hist(group_no, bins=20, alpha=0.5, color=colors[0], label='No', edgecolor='black')
plt.hist(group_yes, bins=20, alpha=0.5, color=colors[1], label='Yes', edgecolor='black')
plt.title(f'{histogram_grouping_variable} Versus {histogram_frequency_variable}')
plt.xlabel(histogram_frequency_variable)
plt.ylabel('Frequency')
plt.legend()
plt.show()
No description has been provided for this image
In [84]:
##################################
# Performing a general exploration of the categorical variable levels
# based on the ordered categories
##################################
ordered_cat_cols = thyroid_cancer_train.select_dtypes(include=["category"]).columns
for col in ordered_cat_cols:
    print(f"Column: {col}")
    print("Absolute Frequencies:")
    print(thyroid_cancer_train[col].value_counts().reindex(thyroid_cancer_train[col].cat.categories))
    print("\nNormalized Frequencies:")
    print(thyroid_cancer_train[col].value_counts(normalize=True).reindex(thyroid_cancer_train[col].cat.categories))
    print("-" * 50)
    
Column: Gender
Absolute Frequencies:
M     44
F    160
Name: count, dtype: int64

Normalized Frequencies:
M    0.215686
F    0.784314
Name: proportion, dtype: float64
--------------------------------------------------
Column: Smoking
Absolute Frequencies:
No     177
Yes     27
Name: count, dtype: int64

Normalized Frequencies:
No     0.867647
Yes    0.132353
Name: proportion, dtype: float64
--------------------------------------------------
Column: Hx_Smoking
Absolute Frequencies:
No     193
Yes     11
Name: count, dtype: int64

Normalized Frequencies:
No     0.946078
Yes    0.053922
Name: proportion, dtype: float64
--------------------------------------------------
Column: Hx_Radiotherapy
Absolute Frequencies:
No     202
Yes      2
Name: count, dtype: int64

Normalized Frequencies:
No     0.990196
Yes    0.009804
Name: proportion, dtype: float64
--------------------------------------------------
Column: Thyroid_Function
Absolute Frequencies:
Euthyroid                      171
Subclinical Hypothyroidism      10
Subclinical Hyperthyroidism      3
Clinical Hypothyroidism          7
Clinical Hyperthyroidism        13
Name: count, dtype: int64

Normalized Frequencies:
Euthyroid                      0.838235
Subclinical Hypothyroidism     0.049020
Subclinical Hyperthyroidism    0.014706
Clinical Hypothyroidism        0.034314
Clinical Hyperthyroidism       0.063725
Name: proportion, dtype: float64
--------------------------------------------------
Column: Physical_Examination
Absolute Frequencies:
Normal                          4
Single nodular goiter-left     50
Single nodular goiter-right    68
Multinodular goiter            79
Diffuse goiter                  3
Name: count, dtype: int64

Normalized Frequencies:
Normal                         0.019608
Single nodular goiter-left     0.245098
Single nodular goiter-right    0.333333
Multinodular goiter            0.387255
Diffuse goiter                 0.014706
Name: proportion, dtype: float64
--------------------------------------------------
Column: Adenopathy
Absolute Frequencies:
No           144
Left          14
Right         21
Bilateral     19
Posterior      2
Extensive      4
Name: count, dtype: int64

Normalized Frequencies:
No           0.705882
Left         0.068627
Right        0.102941
Bilateral    0.093137
Posterior    0.009804
Extensive    0.019608
Name: proportion, dtype: float64
--------------------------------------------------
Column: Pathology
Absolute Frequencies:
Hurthle Cell       15
Follicular         14
Micropapillary     26
Papillary         149
Name: count, dtype: int64

Normalized Frequencies:
Hurthle Cell      0.073529
Follicular        0.068627
Micropapillary    0.127451
Papillary         0.730392
Name: proportion, dtype: float64
--------------------------------------------------
Column: Focality
Absolute Frequencies:
Uni-Focal      129
Multi-Focal     75
Name: count, dtype: int64

Normalized Frequencies:
Uni-Focal      0.632353
Multi-Focal    0.367647
Name: proportion, dtype: float64
--------------------------------------------------
Column: Risk
Absolute Frequencies:
Low             127
Intermediate     60
High             17
Name: count, dtype: int64

Normalized Frequencies:
Low             0.622549
Intermediate    0.294118
High            0.083333
Name: proportion, dtype: float64
--------------------------------------------------
Column: T
Absolute Frequencies:
T1a    26
T1b    21
T2     73
T3a    58
T3b    10
T4a    12
T4b     4
Name: count, dtype: int64

Normalized Frequencies:
T1a    0.127451
T1b    0.102941
T2     0.357843
T3a    0.284314
T3b    0.049020
T4a    0.058824
T4b    0.019608
Name: proportion, dtype: float64
--------------------------------------------------
Column: N
Absolute Frequencies:
N0     139
N1a     11
N1b     54
Name: count, dtype: int64

Normalized Frequencies:
N0     0.681373
N1a    0.053922
N1b    0.264706
Name: proportion, dtype: float64
--------------------------------------------------
Column: M
Absolute Frequencies:
M0    194
M1     10
Name: count, dtype: int64

Normalized Frequencies:
M0    0.95098
M1    0.04902
Name: proportion, dtype: float64
--------------------------------------------------
Column: Stage
Absolute Frequencies:
I      174
II      21
III      2
IVA      2
IVB      5
Name: count, dtype: int64

Normalized Frequencies:
I      0.852941
II     0.102941
III    0.009804
IVA    0.009804
IVB    0.024510
Name: proportion, dtype: float64
--------------------------------------------------
Column: Response
Absolute Frequencies:
Excellent                 109
Structural Incomplete      53
Biochemical Incomplete      8
Indeterminate              34
Name: count, dtype: int64

Normalized Frequencies:
Excellent                 0.534314
Structural Incomplete     0.259804
Biochemical Incomplete    0.039216
Indeterminate             0.166667
Name: proportion, dtype: float64
--------------------------------------------------
Column: Recurred
Absolute Frequencies:
No     143
Yes     61
Name: count, dtype: int64

Normalized Frequencies:
No     0.70098
Yes    0.29902
Name: proportion, dtype: float64
--------------------------------------------------
In [85]:
##################################
# Segregating the target variable
# and categorical predictors
##################################
proportion_y_variables = thyroid_cancer_train_predictors_categorical
proportion_x_variable = 'Recurred'
In [86]:
##################################
# Defining the number of 
# rows and columns for the subplots
##################################
num_rows = 5
num_cols = 3

##################################
# Formulating the subplot structure
##################################
fig, axes = plt.subplots(num_rows, num_cols, figsize=(15, 25))

##################################
# Flattening the multi-row and
# multi-column axes
##################################
axes = axes.ravel()

##################################
# Formulating the individual stacked column plots
# for all categorical columns
##################################
for i, y_variable in enumerate(proportion_y_variables):
    ax = axes[i]
    category_counts = thyroid_cancer_train.groupby([proportion_x_variable, y_variable], observed=True).size().unstack(fill_value=0)
    category_proportions = category_counts.div(category_counts.sum(axis=1), axis=0)
    category_proportions.plot(kind='bar', stacked=True, ax=ax)
    ax.set_title(f'{proportion_x_variable} Versus {y_variable}')
    ax.set_xlabel(proportion_x_variable)
    ax.set_ylabel('Proportions')
    ax.legend(loc="lower center")

##################################
# Adjusting the subplot layout
##################################
plt.tight_layout()

##################################
# Presenting the subplots
##################################
plt.show()
No description has been provided for this image
In [87]:
##################################
# Removing predictors observed with extreme
# near-zero variance and a limited number of levels
##################################
thyroid_cancer_train_column_filtered = thyroid_cancer_train.drop(columns=['Hx_Radiotherapy','M','Hx_Smoking'])
thyroid_cancer_train_column_filtered.head()
Out[87]:
Age Gender Smoking Thyroid_Function Physical_Examination Adenopathy Pathology Focality Risk T N Stage Response Recurred
140 28 F No Euthyroid Multinodular goiter No Papillary Uni-Focal Low T2 N0 I Excellent No
205 36 F No Euthyroid Single nodular goiter-right Right Papillary Uni-Focal Low T2 N1b I Indeterminate No
277 41 M Yes Euthyroid Single nodular goiter-right No Hurthle Cell Multi-Focal Intermediate T3a N0 I Excellent No
294 42 M No Subclinical Hypothyroidism Single nodular goiter-right No Papillary Multi-Focal Intermediate T3a N1a I Indeterminate No
268 32 F No Euthyroid Single nodular goiter-left No Papillary Uni-Focal Low T3a N0 I Excellent No

1.4.3 Category Aggregration and Encoding ¶

  1. Category aggregation was applied to the previously identified categorical predictors observed with many levels (high-cardinality) containing only a few observations to improve model stability during cross-validation and enhance generalization:
    • Thyroid_Function:
      • 171 Thyroid_Function=Euthyroid: 83.82%
      • 33 Thyroid_Function=Hypothyroidism or Hyperthyroidism: 16.18%
    • Physical_Examination:
      • 122 Physical_Examination=Normal or Single Nodular Goiter : 59.80%
      • 82 Physical_Examination=Multinodular or Diffuse Goiter: 40.20%
    • Adenopathy:
      • 144 Adenopathy=No: 70.59%
      • 60 Adenopathy=Yes: 29.41%
    • Pathology:
      • 29 Pathology=Non-Papillary : 14.22%
      • 175 Pathology=Papillary: 85.78%
    • Risk:
      • 127 Risk=Low: 62.25%
      • 77 Risk=Intermediate to High: 37.75%
    • T:
      • 120 T=T1 to T2: 58.82%
      • 84 T=T3 to T4b: 41.18%
    • N:
      • 139 N=N0: 68.14%
      • 65 N=N1: 31.86%
    • Stage:
      • 174 Stage=I: 85.29%
      • 30 Stage=II to IVB: 14.71%
    • Response:
      • 109 Response=Excellent: 53.43%
      • 95 Response=Indeterminate or Incomplete: 46.57%
In [88]:
##################################
# Merging small categories into broader groups 
# for certain categorical predictors
# to ensure sufficient representation in statistical models 
# and prevent sparsity issues in cross-validation
##################################
thyroid_cancer_train_column_filtered['Thyroid_Function'] = thyroid_cancer_train_column_filtered['Thyroid_Function'].map(lambda x: 'Euthyroid' if (x in ['Euthyroid'])  else 'Hypothyroidism or Hyperthyroidism').astype('category')
thyroid_cancer_train_column_filtered['Physical_Examination'] = thyroid_cancer_train_column_filtered['Physical_Examination'].map(lambda x: 'Normal or Single Nodular Goiter' if (x in ['Normal', 'Single nodular goiter-left', 'Single nodular goiter-right'])  else 'Multinodular or Diffuse Goiter').astype('category')
thyroid_cancer_train_column_filtered['Adenopathy'] = thyroid_cancer_train_column_filtered['Adenopathy'].map(lambda x: 'No' if x == 'No' else ('Yes' if pd.notna(x) and x != '' else x)).astype('category')
thyroid_cancer_train_column_filtered['Pathology'] = thyroid_cancer_train_column_filtered['Pathology'].map(lambda x: 'Non-Papillary' if (x in ['Hurthle Cell', 'Follicular'])  else 'Papillary').astype('category')
thyroid_cancer_train_column_filtered['Risk'] = thyroid_cancer_train_column_filtered['Risk'].map(lambda x: 'Low' if (x in ['Low'])  else 'Intermediate to High').astype('category')
thyroid_cancer_train_column_filtered['T'] = thyroid_cancer_train_column_filtered['T'].map(lambda x: 'T1 to T2' if (x in ['T1a', 'T1b', 'T2'])  else 'T3 to T4b').astype('category')
thyroid_cancer_train_column_filtered['N'] = thyroid_cancer_train_column_filtered['N'].map(lambda x: 'N0' if (x in ['N0'])  else 'N1').astype('category')
thyroid_cancer_train_column_filtered['Stage'] = thyroid_cancer_train_column_filtered['Stage'].map(lambda x: 'I' if (x in ['I'])  else 'II to IVB').astype('category')
thyroid_cancer_train_column_filtered['Response'] = thyroid_cancer_train_column_filtered['Response'].map(lambda x: 'Indeterminate or Incomplete' if (x in ['Indeterminate', 'Structural Incomplete', 'Biochemical Incomplete'])  else 'Excellent').astype('category')
thyroid_cancer_train_column_filtered.head()
Out[88]:
Age Gender Smoking Thyroid_Function Physical_Examination Adenopathy Pathology Focality Risk T N Stage Response Recurred
140 28 F No Euthyroid Multinodular or Diffuse Goiter No Papillary Uni-Focal Low T1 to T2 N0 I Excellent No
205 36 F No Euthyroid Normal or Single Nodular Goiter Yes Papillary Uni-Focal Low T1 to T2 N1 I Indeterminate or Incomplete No
277 41 M Yes Euthyroid Normal or Single Nodular Goiter No Non-Papillary Multi-Focal Intermediate to High T3 to T4b N0 I Excellent No
294 42 M No Hypothyroidism or Hyperthyroidism Normal or Single Nodular Goiter No Papillary Multi-Focal Intermediate to High T3 to T4b N1 I Indeterminate or Incomplete No
268 32 F No Euthyroid Normal or Single Nodular Goiter No Papillary Uni-Focal Low T3 to T4b N0 I Excellent No
In [89]:
##################################
# Performing a general exploration of the categorical variable levels
# based on the ordered categories
##################################
ordered_cat_cols = thyroid_cancer_train_column_filtered.select_dtypes(include=["category"]).columns
for col in ordered_cat_cols:
    print(f"Column: {col}")
    print("Absolute Frequencies:")
    print(thyroid_cancer_train_column_filtered[col].value_counts().reindex(thyroid_cancer_train_column_filtered[col].cat.categories))
    print("\nNormalized Frequencies:")
    print(thyroid_cancer_train_column_filtered[col].value_counts(normalize=True).reindex(thyroid_cancer_train_column_filtered[col].cat.categories))
    print("-" * 50)
    
Column: Gender
Absolute Frequencies:
M     44
F    160
Name: count, dtype: int64

Normalized Frequencies:
M    0.215686
F    0.784314
Name: proportion, dtype: float64
--------------------------------------------------
Column: Smoking
Absolute Frequencies:
No     177
Yes     27
Name: count, dtype: int64

Normalized Frequencies:
No     0.867647
Yes    0.132353
Name: proportion, dtype: float64
--------------------------------------------------
Column: Thyroid_Function
Absolute Frequencies:
Euthyroid                            171
Hypothyroidism or Hyperthyroidism     33
Name: count, dtype: int64

Normalized Frequencies:
Euthyroid                            0.838235
Hypothyroidism or Hyperthyroidism    0.161765
Name: proportion, dtype: float64
--------------------------------------------------
Column: Physical_Examination
Absolute Frequencies:
Multinodular or Diffuse Goiter      82
Normal or Single Nodular Goiter    122
Name: count, dtype: int64

Normalized Frequencies:
Multinodular or Diffuse Goiter     0.401961
Normal or Single Nodular Goiter    0.598039
Name: proportion, dtype: float64
--------------------------------------------------
Column: Adenopathy
Absolute Frequencies:
No     144
Yes     60
Name: count, dtype: int64

Normalized Frequencies:
No     0.705882
Yes    0.294118
Name: proportion, dtype: float64
--------------------------------------------------
Column: Pathology
Absolute Frequencies:
Non-Papillary     29
Papillary        175
Name: count, dtype: int64

Normalized Frequencies:
Non-Papillary    0.142157
Papillary        0.857843
Name: proportion, dtype: float64
--------------------------------------------------
Column: Focality
Absolute Frequencies:
Uni-Focal      129
Multi-Focal     75
Name: count, dtype: int64

Normalized Frequencies:
Uni-Focal      0.632353
Multi-Focal    0.367647
Name: proportion, dtype: float64
--------------------------------------------------
Column: Risk
Absolute Frequencies:
Intermediate to High     77
Low                     127
Name: count, dtype: int64

Normalized Frequencies:
Intermediate to High    0.377451
Low                     0.622549
Name: proportion, dtype: float64
--------------------------------------------------
Column: T
Absolute Frequencies:
T1 to T2     120
T3 to T4b     84
Name: count, dtype: int64

Normalized Frequencies:
T1 to T2     0.588235
T3 to T4b    0.411765
Name: proportion, dtype: float64
--------------------------------------------------
Column: N
Absolute Frequencies:
N0    139
N1     65
Name: count, dtype: int64

Normalized Frequencies:
N0    0.681373
N1    0.318627
Name: proportion, dtype: float64
--------------------------------------------------
Column: Stage
Absolute Frequencies:
I            174
II to IVB     30
Name: count, dtype: int64

Normalized Frequencies:
I            0.852941
II to IVB    0.147059
Name: proportion, dtype: float64
--------------------------------------------------
Column: Response
Absolute Frequencies:
Excellent                      109
Indeterminate or Incomplete     95
Name: count, dtype: int64

Normalized Frequencies:
Excellent                      0.534314
Indeterminate or Incomplete    0.465686
Name: proportion, dtype: float64
--------------------------------------------------
Column: Recurred
Absolute Frequencies:
No     143
Yes     61
Name: count, dtype: int64

Normalized Frequencies:
No     0.70098
Yes    0.29902
Name: proportion, dtype: float64
--------------------------------------------------
In [90]:
##################################
# Segregating the target
# and predictor variables
##################################
thyroid_cancer_train_predictors = thyroid_cancer_train_column_filtered.iloc[:,:-1].columns
thyroid_cancer_train_predictors_numeric = thyroid_cancer_train_column_filtered.iloc[:,:-1].loc[:, thyroid_cancer_train_column_filtered.iloc[:,:-1].columns == 'Age'].columns
thyroid_cancer_train_predictors_categorical = thyroid_cancer_train_column_filtered.iloc[:,:-1].loc[:,thyroid_cancer_train_column_filtered.iloc[:,:-1].columns != 'Age'].columns
In [91]:
##################################
# Segregating the target variable
# and categorical predictors
##################################
proportion_y_variables = thyroid_cancer_train_predictors_categorical
proportion_x_variable = 'Recurred'
In [92]:
##################################
# Defining the number of 
# rows and columns for the subplots
##################################
num_rows = 4
num_cols = 3

##################################
# Formulating the subplot structure
##################################
fig, axes = plt.subplots(num_rows, num_cols, figsize=(15, 20))

##################################
# Flattening the multi-row and
# multi-column axes
##################################
axes = axes.ravel()

##################################
# Formulating the individual stacked column plots
# for all categorical columns
##################################
for i, y_variable in enumerate(proportion_y_variables):
    ax = axes[i]
    category_counts = thyroid_cancer_train_column_filtered.groupby([proportion_x_variable, y_variable], observed=True).size().unstack(fill_value=0)
    category_proportions = category_counts.div(category_counts.sum(axis=1), axis=0)
    category_proportions.plot(kind='bar', stacked=True, ax=ax)
    ax.set_title(f'{proportion_x_variable} Versus {y_variable}')
    ax.set_xlabel(proportion_x_variable)
    ax.set_ylabel('Proportions')
    ax.legend(loc="lower center")

##################################
# Adjusting the subplot layout
##################################
plt.tight_layout()

##################################
# Presenting the subplots
##################################
plt.show()
No description has been provided for this image

1.4.4 Outlier and Distributional Shape Analysis ¶

  1. No outliers (Outlier.Count>0, Outlier.Ratio>0.000), high skewness (Skewness>3 or Skewness<(-3)) or abnormal kurtosis (Skewness>2 or Skewness<(-2)) observed for the numeric predictor.
    • Age: Outlier.Count = 0, Outlier.Ratio = 0.000, Skewness = 0.525, Kurtosis = -0.494
In [93]:
##################################
# Formulating the imputed dataset
# with numeric columns only
##################################
thyroid_cancer_train_column_filtered['Age'] = pd.to_numeric(thyroid_cancer_train_column_filtered['Age'])
thyroid_cancer_train_column_filtered_numeric = thyroid_cancer_train_column_filtered.select_dtypes(include='number')
thyroid_cancer_train_column_filtered_numeric = thyroid_cancer_train_column_filtered_numeric.to_frame() if isinstance(thyroid_cancer_train_column_filtered_numeric, pd.Series) else thyroid_cancer_train_column_filtered_numeric
In [94]:
##################################
# Gathering the variable names for each numeric column
##################################
numeric_variable_name_list = list(thyroid_cancer_train_column_filtered_numeric.columns)
In [95]:
##################################
# Gathering the skewness value for each numeric column
##################################
numeric_skewness_list = thyroid_cancer_train_column_filtered_numeric.skew()
In [96]:
##################################
# Computing the interquartile range
# for all columns
##################################
thyroid_cancer_train_column_filtered_numeric_q1 = thyroid_cancer_train_column_filtered_numeric.quantile(0.25)
thyroid_cancer_train_column_filtered_numeric_q3 = thyroid_cancer_train_column_filtered_numeric.quantile(0.75)
thyroid_cancer_train_column_filtered_numeric_iqr = thyroid_cancer_train_column_filtered_numeric_q3 - thyroid_cancer_train_column_filtered_numeric_q1
In [97]:
##################################
# Gathering the outlier count for each numeric column
# based on the interquartile range criterion
##################################
numeric_outlier_count_list = ((thyroid_cancer_train_column_filtered_numeric < (thyroid_cancer_train_column_filtered_numeric_q1 - 1.5 * thyroid_cancer_train_column_filtered_numeric_iqr)) | (thyroid_cancer_train_column_filtered_numeric > (thyroid_cancer_train_column_filtered_numeric_q3 + 1.5 * thyroid_cancer_train_column_filtered_numeric_iqr))).sum() 
In [98]:
##################################
# Gathering the number of observations for each column
##################################
numeric_row_count_list = list([len(thyroid_cancer_train_column_filtered_numeric)] * len(thyroid_cancer_train_column_filtered_numeric.columns))
In [99]:
##################################
# Gathering the unique to count ratio for each numeric column
##################################
numeric_outlier_ratio_list = map(truediv, numeric_outlier_count_list, numeric_row_count_list)
In [100]:
##################################
# Gathering the skewness value for each numeric column
##################################
numeric_skewness_list = thyroid_cancer_train_column_filtered_numeric.skew()
In [101]:
##################################
# Gathering the kurtosis value for each numeric column
##################################
numeric_kurtosis_list = thyroid_cancer_train_column_filtered_numeric.kurtosis()
In [102]:
##################################
# Formulating the outlier summary
# for all numeric columns
##################################
numeric_column_outlier_summary = pd.DataFrame(zip(numeric_variable_name_list,
                                                  numeric_skewness_list,
                                                  numeric_outlier_count_list,
                                                  numeric_row_count_list,
                                                  numeric_outlier_ratio_list,
                                                  numeric_skewness_list,
                                                  numeric_kurtosis_list), 
                                        columns=['Numeric.Column.Name',
                                                 'Skewness',
                                                 'Outlier.Count',
                                                 'Row.Count',
                                                 'Outlier.Ratio',
                                                 'Skewness',
                                                 'Kurtosis'])
display(numeric_column_outlier_summary)
Numeric.Column.Name Skewness Outlier.Count Row.Count Outlier.Ratio Skewness Kurtosis
0 Age 0.525218 0 204 0.0 0.525218 -0.494286
In [103]:
##################################
# Formulating the individual boxplots
# for all numeric columns
##################################
for column in thyroid_cancer_train_column_filtered_numeric:
        plt.figure(figsize=(17,1))
        sns.boxplot(data=thyroid_cancer_train_column_filtered_numeric, x=column)
    
No description has been provided for this image

1.4.5 Collinearity ¶

  1. Majority of the predictors reported low (<0.50) to moderate (0.50 to 0.75) correlation.
  2. Among pairwise combinations of categorical predictors, high Phi.Coefficient values were noted for:
    • N and Adenopathy: Phi.Coefficient = +0.805
    • N and Risk: Phi.Coefficient = +0.726
    • Adenopathy and Risk: Phi.Coefficient = +0.674
In [104]:
##################################
# Creating a dataset copy and
# converting all values to numeric
# for correlation analysis
##################################
pd.set_option('future.no_silent_downcasting', True)
thyroid_cancer_train_correlation = thyroid_cancer_train_column_filtered.copy()
thyroid_cancer_train_correlation_object = thyroid_cancer_train_correlation.iloc[:,1:13].columns
custom_category_orders = {
    'Gender': ['M', 'F'],  
    'Smoking': ['No', 'Yes'],  
    'Thyroid_Function': ['Euthyroid', 'Hypothyroidism or Hyperthyroidism'],  
    'Physical_Examination': ['Normal or Single Nodular Goiter', 'Multinodular or Diffuse Goiter'],  
    'Adenopathy': ['No', 'Yes'],  
    'Pathology': ['Non-Papillary', 'Papillary'],  
    'Focality': ['Uni-Focal', 'Multi-Focal'],  
    'Risk': ['Low', 'Intermediate to High'],  
    'T': ['T1 to T2', 'T3 to T4b'],  
    'N': ['N0', 'N1'],  
    'Stage': ['I', 'II to IVB'],  
    'Response': ['Excellent', 'Indeterminate or Incomplete'] 
}
encoder = OrdinalEncoder(categories=[custom_category_orders[col] for col in thyroid_cancer_train_correlation_object])
thyroid_cancer_train_correlation[thyroid_cancer_train_correlation_object] = encoder.fit_transform(
    thyroid_cancer_train_correlation[thyroid_cancer_train_correlation_object]
)
thyroid_cancer_train_correlation = thyroid_cancer_train_correlation.drop(['Recurred'], axis=1)
display(thyroid_cancer_train_correlation)
Age Gender Smoking Thyroid_Function Physical_Examination Adenopathy Pathology Focality Risk T N Stage Response
140 28 1.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
205 36 1.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 1.0 0.0 1.0
277 41 0.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0
294 42 0.0 0.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 0.0 1.0
268 32 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ...
300 67 1.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 1.0 0.0 1.0 1.0
115 37 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
67 51 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
161 22 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
55 21 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0

204 rows × 13 columns

In [105]:
##################################
# Initializing the correlation matrix
##################################
thyroid_cancer_train_correlation_matrix = pd.DataFrame(np.zeros((len(thyroid_cancer_train_correlation.columns), len(thyroid_cancer_train_correlation.columns))),
                                                       columns=thyroid_cancer_train_correlation.columns,
                                                       index=thyroid_cancer_train_correlation.columns)
In [106]:
##################################
# Creating an empty correlation matrix
##################################
thyroid_cancer_train_correlation_matrix = pd.DataFrame(
    np.zeros((len(thyroid_cancer_train_correlation.columns), len(thyroid_cancer_train_correlation.columns))),
    index=thyroid_cancer_train_correlation.columns,
    columns=thyroid_cancer_train_correlation.columns
)


##################################
# Calculating different types
# of correlation coefficients
# per variable type
##################################
for i in range(len(thyroid_cancer_train_correlation.columns)):
    for j in range(i, len(thyroid_cancer_train_correlation.columns)):
        if i == j:
            thyroid_cancer_train_correlation_matrix.iloc[i, j] = 1.0  
        else:
            col_i = thyroid_cancer_train_correlation.iloc[:, i]
            col_j = thyroid_cancer_train_correlation.iloc[:, j]

            # Detecting binary variables (assumes binary variables are coded as 0/1)
            is_binary_i = col_i.nunique() == 2
            is_binary_j = col_j.nunique() == 2

            # Computing the Pearson correlation for two continuous variables
            if col_i.dtype in ['int64', 'float64'] and col_j.dtype in ['int64', 'float64']:
                corr = col_i.corr(col_j)

            # Computing the Point-Biserial correlation for continuous and binary variables
            elif (col_i.dtype in ['int64', 'float64'] and is_binary_j) or (col_j.dtype in ['int64', 'float64'] and is_binary_i):
                continuous_var = col_i if col_i.dtype in ['int64', 'float64'] else col_j
                binary_var = col_j if is_binary_j else col_i

                # Convert binary variable to 0/1 (if not already)
                binary_var = binary_var.astype('category').cat.codes
                corr, _ = pointbiserialr(continuous_var, binary_var)

            # Computing the Phi coefficient for two binary variables
            elif is_binary_i and is_binary_j:
                corr = col_i.corr(col_j) 

            # Computing the Cramér's V for two categorical variables (if more than 2 categories)
            else:
                contingency_table = pd.crosstab(col_i, col_j)
                chi2, _, _, _ = chi2_contingency(contingency_table)
                n = contingency_table.sum().sum()
                phi2 = chi2 / n
                r, k = contingency_table.shape
                corr = np.sqrt(phi2 / min(k - 1, r - 1))  # Cramér's V formula

            # Assigning correlation values to the matrix
            thyroid_cancer_train_correlation_matrix.iloc[i, j] = corr
            thyroid_cancer_train_correlation_matrix.iloc[j, i] = corr

# Displaying the correlation matrix
display(thyroid_cancer_train_correlation_matrix)
            
Age Gender Smoking Thyroid_Function Physical_Examination Adenopathy Pathology Focality Risk T N Stage Response
Age 1.000000 -0.185530 0.299971 0.077845 0.012021 0.073931 -0.215274 0.195272 0.205360 0.246838 0.013195 0.528144 0.317978
Gender -0.185530 1.000000 -0.604101 -0.093290 -0.031935 -0.158480 0.127817 -0.218103 -0.255507 -0.215101 -0.178550 -0.219727 -0.179431
Smoking 0.299971 -0.604101 1.000000 0.064124 0.004339 0.192350 -0.338086 0.182212 0.233024 0.231679 0.105463 0.327952 0.215362
Thyroid_Function 0.077845 -0.093290 0.064124 1.000000 0.019964 -0.137486 -0.049893 0.051564 -0.012519 -0.042960 -0.043275 0.080702 -0.036498
Physical_Examination 0.012021 -0.031935 0.004339 0.019964 1.000000 0.063246 0.018806 0.245779 0.166012 0.086039 0.104553 0.054799 0.116526
Adenopathy 0.073931 -0.158480 0.192350 -0.137486 0.063246 1.000000 0.047117 0.288750 0.673638 0.421762 0.805406 0.278749 0.518887
Pathology -0.215274 0.127817 -0.338086 -0.049893 0.018806 0.047117 1.000000 -0.126299 -0.117392 -0.286899 0.157869 -0.187683 -0.154637
Focality 0.195272 -0.218103 0.182212 0.051564 0.245779 0.288750 -0.126299 1.000000 0.454926 0.518864 0.307716 0.372331 0.388741
Risk 0.205360 -0.255507 0.233024 -0.012519 0.166012 0.673638 -0.117392 0.454926 1.000000 0.622459 0.726304 0.533264 0.631330
T 0.246838 -0.215101 0.231679 -0.042960 0.086039 0.421762 -0.286899 0.518864 0.622459 1.000000 0.368430 0.468168 0.556742
N 0.013195 -0.178550 0.105463 -0.043275 0.104553 0.805406 0.157869 0.307716 0.726304 0.368430 1.000000 0.310156 0.542672
Stage 0.528144 -0.219727 0.327952 0.080702 0.054799 0.278749 -0.187683 0.372331 0.533264 0.468168 0.310156 1.000000 0.417025
Response 0.317978 -0.179431 0.215362 -0.036498 0.116526 0.518887 -0.154637 0.388741 0.631330 0.556742 0.542672 0.417025 1.000000
In [107]:
##################################
# Plotting the correlation matrix
# for all pairwise combinations
# of numeric and categorical columns
##################################
plt.figure(figsize=(17, 8))
sns.heatmap(thyroid_cancer_train_correlation_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.show()
No description has been provided for this image

1.5. Data Exploration ¶

1.5.1 Exploratory Data Analysis ¶

  1. Bivariate analysis identified individual predictors with generally positive association to the target variable based on visual inspection.
  2. Higher values or higher proportions for the following predictors are associated with the Recurred=Yes category:
    • Age
    • Gender=M
    • Smoking=Yes
    • Physical_Examination=Multinodular or Diffuse Goiter
    • Adenopathy=Yes
    • Focality=Multi-Focal
    • Risk=Intermediate to High
    • T=T3 to T4b
    • N=N1
    • Stage=II to IVB
    • Response=Indeterminate or Incomplete
  3. Proportions for the following predictors are not associated with the Recurred=Yes or Recurred=No categories:
    • Thyroid_Function
    • Pathology
In [108]:
##################################
# Segregating the target
# and predictor variables
##################################
thyroid_cancer_train_column_filtered_predictors = thyroid_cancer_train_column_filtered.iloc[:,:-1].columns
thyroid_cancer_train_column_filtered_predictors_numeric = thyroid_cancer_train_column_filtered.iloc[:,:-1].loc[:, thyroid_cancer_train_column_filtered.iloc[:,:-1].columns == 'Age'].columns
thyroid_cancer_train_column_filtered_predictors_categorical = thyroid_cancer_train_column_filtered.iloc[:,:-1].loc[:,thyroid_cancer_train_column_filtered.iloc[:,:-1].columns != 'Age'].columns
In [109]:
##################################
# Gathering the variable names for each numeric column
##################################
numeric_variable_name_list = thyroid_cancer_train_column_filtered_predictors_numeric
In [110]:
##################################
# Segregating the target variable
# and numeric predictors
##################################
boxplot_y_variable = 'Recurred'
boxplot_x_variable = numeric_variable_name_list.values[0]
In [111]:
##################################
# Evaluating the numeric predictors
# against the target variable
##################################
plt.figure(figsize=(7, 5))
plt.boxplot([group[boxplot_x_variable] for name, group in thyroid_cancer_train_column_filtered.groupby(boxplot_y_variable, observed=True)])
plt.title(f'{boxplot_y_variable} Versus {boxplot_x_variable}')
plt.xlabel(boxplot_y_variable)
plt.ylabel(boxplot_x_variable)
plt.xticks(range(1, len(thyroid_cancer_train_column_filtered[boxplot_y_variable].unique()) + 1), ['No', 'Yes'])
plt.show()
No description has been provided for this image
In [112]:
##################################
# Segregating the target variable
# and categorical predictors
##################################
proportion_y_variables = thyroid_cancer_train_column_filtered_predictors_categorical
proportion_x_variable = 'Recurred'
In [113]:
##################################
# Defining the number of 
# rows and columns for the subplots
##################################
num_rows = 4
num_cols = 3

##################################
# Formulating the subplot structure
##################################
fig, axes = plt.subplots(num_rows, num_cols, figsize=(15, 20))

##################################
# Flattening the multi-row and
# multi-column axes
##################################
axes = axes.ravel()

##################################
# Formulating the individual stacked column plots
# for all categorical columns
##################################
for i, y_variable in enumerate(proportion_y_variables):
    ax = axes[i]
    category_counts = thyroid_cancer_train_column_filtered.groupby([proportion_x_variable, y_variable], observed=True).size().unstack(fill_value=0)
    category_proportions = category_counts.div(category_counts.sum(axis=1), axis=0)
    category_proportions.plot(kind='bar', stacked=True, ax=ax)
    ax.set_title(f'{proportion_x_variable} Versus {y_variable}')
    ax.set_xlabel(proportion_x_variable)
    ax.set_ylabel('Proportions')
    ax.legend(loc="lower center")

##################################
# Adjusting the subplot layout
##################################
plt.tight_layout()

##################################
# Presenting the subplots
##################################
plt.show()
No description has been provided for this image

1.5.2 Hypothesis Testing ¶

  1. The relationship between the numeric predictor to the Recurred target variable was statistically evaluated using the following hypotheses:
    • Null: Difference in the means between groups Yes and No is equal to zero
    • Alternative: Difference in the means between groups Yes and No is not equal to zero
  2. There is sufficient evidence to conclude of a statistically significant difference between the means of the numeric measurements obtained from Yes and No groups of the Recurred target variable in 1 of 1 numeric predictor given its high t-test statistic values with reported low p-values less than the significance level of 0.05.
    • Age: T.Test.Statistic=-3.791, T.Test.PValue=0.000
  3. The relationship between the categorical predictors to the Recurred target variable was statistically evaluated using the following hypotheses:
    • Null: The categorical predictor is independent of the categorical target variable
    • Alternative: The categorical predictor is dependent of the categorical target variable
  4. There is sufficient evidence to conclude of a statistically significant relationship between the categories of the categorical predictors and the Yes and No groups of the Recurred target variable in 9 of 12 categorical predictors given their high chisquare statistic values with reported low p-values less than the significance level of 0.05.
    • Risk: ChiSquare.Test.Statistic=98.599, ChiSquare.Test.PValue=0.000
    • Response: ChiSquare.Test.Statistic=90.866, ChiSquare.Test.PValue=0.000
    • Adenopathy: ChiSquare.Test.Statistic=73.585, ChiSquare.Test.PValue=0.000
    • N: ChiSquare.Test.Statistic=73.176, ChiSquare.Test.PValue=0.000
    • T: ChiSquare.Test.Statistic=62.205, ChiSquare.Test.PValue=0.000
    • Stage: ChiSquare.Test.Statistic=44.963, ChiSquare.Test.PValue=0.000
    • Focality: ChiSquare.Test.Statistic=32.859, ChiSquare.Test.PValue=0.000
    • Gender: ChiSquare.Test.Statistic=17.787, ChiSquare.Test.PValue=0.000
    • Smoking: ChiSquare.Test.Statistic=14.460, ChiSquare.Test.PValue=0.001
  5. There is marginal evidence to conclude of a statistically significant relationship between the categories of the categorical predictors and the Yes and No groups of the Recurred target variable in 1 of 12 categorical predictors given its sufficiently high chisquare statistic values with reported low p-values near the significance level of 0.10.
    • Physical_Examination: ChiSquare.Test.Statistic=2.413, ChiSquare.Test.PValue=0.120
In [114]:
##################################
# Computing the t-test 
# statistic and p-values
# between the target variable
# and numeric predictor columns
##################################
thyroid_cancer_numeric_ttest_target = {}
thyroid_cancer_numeric = thyroid_cancer_train_column_filtered.loc[:,(thyroid_cancer_train_column_filtered.columns == 'Age') | (thyroid_cancer_train_column_filtered.columns == 'Recurred')]
thyroid_cancer_numeric_columns = thyroid_cancer_train_column_filtered_predictors_numeric
for numeric_column in thyroid_cancer_numeric_columns:
    group_0 = thyroid_cancer_numeric[thyroid_cancer_numeric.loc[:,'Recurred']=='No']
    group_1 = thyroid_cancer_numeric[thyroid_cancer_numeric.loc[:,'Recurred']=='Yes']
    thyroid_cancer_numeric_ttest_target['Recurred_' + numeric_column] = stats.ttest_ind(
        group_0[numeric_column], 
        group_1[numeric_column], 
        equal_var=True)
In [115]:
##################################
# Formulating the pairwise ttest summary
# between the target variable
# and numeric predictor columns
##################################
thyroid_cancer_numeric_summary = thyroid_cancer_numeric.from_dict(thyroid_cancer_numeric_ttest_target, orient='index')
thyroid_cancer_numeric_summary.columns = ['T.Test.Statistic', 'T.Test.PValue']
display(thyroid_cancer_numeric_summary.sort_values(by=['T.Test.PValue'], ascending=True).head(len(thyroid_cancer_train_column_filtered_predictors_numeric)))
T.Test.Statistic T.Test.PValue
Recurred_Age -3.747942 0.000233
In [116]:
##################################
# Computing the chisquare
# statistic and p-values
# between the target variable
# and categorical predictor columns
##################################
thyroid_cancer_categorical_chisquare_target = {}
thyroid_cancer_categorical = thyroid_cancer_train_column_filtered.loc[:,(thyroid_cancer_train_column_filtered.columns != 'Age') | (thyroid_cancer_train_column_filtered.columns == 'Recurred')]
thyroid_cancer_categorical_columns = thyroid_cancer_train_column_filtered_predictors_categorical
for categorical_column in thyroid_cancer_categorical_columns:
    contingency_table = pd.crosstab(thyroid_cancer_categorical[categorical_column], 
                                    thyroid_cancer_categorical['Recurred'])
    thyroid_cancer_categorical_chisquare_target['Recurred_' + categorical_column] = stats.chi2_contingency(
        contingency_table)[0:2]
In [117]:
##################################
# Formulating the pairwise chisquare summary
# between the target variable
# and categorical predictor columns
##################################
thyroid_cancer_categorical_summary = thyroid_cancer_categorical.from_dict(thyroid_cancer_categorical_chisquare_target, orient='index')
thyroid_cancer_categorical_summary.columns = ['ChiSquare.Test.Statistic', 'ChiSquare.Test.PValue']
display(thyroid_cancer_categorical_summary.sort_values(by=['ChiSquare.Test.PValue'], ascending=True).head(len(thyroid_cancer_train_column_filtered_predictors_categorical)))
ChiSquare.Test.Statistic ChiSquare.Test.PValue
Recurred_Risk 98.599608 3.090804e-23
Recurred_Response 90.866461 1.537030e-21
Recurred_Adenopathy 73.585561 9.636704e-18
Recurred_N 73.176134 1.185810e-17
Recurred_T 62.205367 3.094435e-15
Recurred_Stage 44.963917 2.006987e-11
Recurred_Focality 32.859398 9.907099e-09
Recurred_Gender 17.787641 2.469824e-05
Recurred_Smoking 14.460357 1.431406e-04
Recurred_Physical_Examination 2.413115 1.203227e-01
Recurred_Thyroid_Function 0.966826 3.254729e-01
Recurred_Pathology 0.131614 7.167646e-01

1.6. Premodelling Data Preparation ¶

1.6.1 Preprocessed Data Description¶

  1. A total of 6 of the 16 predictors were excluded from the dataset based on the data preprocessing and exploration findings
  2. There were 3 categorical predictors excluded from the dataset after having been observed with extremely low variance containing categories with very few or almost no variations across observations that may have limited predictive power or drive increased model complexity without performance gains:
    • Hx_Smoking:
      • 193 Hx_Smoking=No: 94.61%
      • 11 Hx_Smoking=Yes: 5.39%
    • Hx_Radiotherapy:
      • 202 Hx_Radiotherapy=No: 99.02%
      • 2 Hx_Radiotherapy=Yes: 0.98%
    • M:
      • 194 M=M0: 95.10%
      • 10 M=M1: 4.90%
  3. There was 1 categorical predictor excluded from the dataset after having been observed with high pairwise collinearity (Phi.Coefficient>0.70) with other 2 predictors that might provide redundant information, leading to potential instability in regression models.
    • N and Adenopathy: Phi.Coefficient = +0.805
    • N and Risk: Phi.Coefficient = +0.726
  4. Another 2 categorical predictors were excluded from the dataset for not exhibiting a statistically significant association with the Yes and No groups of the Recurred target variable, indicating weak predictive value.
    • Thyroid_Function: ChiSquare.Test.Statistic=0.967, ChiSquare.Test.PValue=0.325
    • Pathology: ChiSquare.Test.Statistic=0.132, ChiSquare.Test.PValue=0.717
  5. The preprocessed train data (final) subset is comprised of:
    • 204 rows (observations)
      • 143 Recurred=No: 70.10%
      • 61 Recurred=Yes: 29.90%
    • 11 columns (variables)
      • 1/11 target (categorical)
        • Recurred
      • 1/11 predictor (numeric)
        • Age
      • 9/11 predictor (categorical)
        • Gender
        • Smoking
        • Physical_Examination
        • Adenopathy
        • Focality
        • Risk
        • T
        • M
        • Stage
        • Response

1.6.2 Preprocessing Pipeline Development¶

  1. A preprocessing pipeline was formulated and applied to the train data (final), validation data and test data with the following actions:
    • Excluded specified columns noted with low variance, high collinearity and weak predictive power
    • Aggregated categories in multiclass categorical variables into binary levels
    • Converted categorical columns to the appropriate type
    • Set the order of category levels for ordinal encoding during modeling pipeline creation
In [118]:
##################################
# Formulating a preprocessing pipeline
# that removes the specified columns,
# aggregates categories in multiclass categorical variables,
# converts categorical columns to the appropriate type, and
# sets the order of category levels
##################################
def preprocess_dataset(df):
    # Removing the specified columns
    columns_to_remove = ['Hx_Smoking', 'Hx_Radiotherapy', 'M', 'N', 'Thyroid_Function', 'Pathology']
    df = df.drop(columns=columns_to_remove)
    
    # Applying category aggregation
    df['Physical_Examination'] = df['Physical_Examination'].map(
        lambda x: 'Normal or Single Nodular Goiter' if x in ['Normal', 'Single nodular goiter-left', 'Single nodular goiter-right'] 
        else 'Multinodular or Diffuse Goiter').astype('category')
    
    df['Adenopathy'] = df['Adenopathy'].map(
        lambda x: 'No' if x == 'No' else ('Yes' if pd.notna(x) and x != '' else x)).astype('category')
    
    df['Risk'] = df['Risk'].map(
        lambda x: 'Low' if x == 'Low' else 'Intermediate to High').astype('category')
    
    df['T'] = df['T'].map(
        lambda x: 'T1 to T2' if x in ['T1a', 'T1b', 'T2'] else 'T3 to T4b').astype('category')
    
    df['Stage'] = df['Stage'].map(
        lambda x: 'I' if x == 'I' else 'II to IVB').astype('category')
    
    df['Response'] = df['Response'].map(
        lambda x: 'Indeterminate or Incomplete' if x in ['Indeterminate', 'Structural Incomplete', 'Biochemical Incomplete'] 
        else 'Excellent').astype('category')
    
    # Setting category levels
    category_mappings = {
        'Gender': ['M', 'F'],
        'Smoking': ['No', 'Yes'],
        'Physical_Examination': ['Normal or Single Nodular Goiter', 'Multinodular or Diffuse Goiter'],
        'Adenopathy': ['No', 'Yes'],
        'Focality': ['Uni-Focal', 'Multi-Focal'],
        'Risk': ['Low', 'Intermediate to High'],
        'T': ['T1 to T2', 'T3 to T4b'],
        'Stage': ['I', 'II to IVB'],
        'Response': ['Excellent', 'Indeterminate or Incomplete']
    }
    
    for col, categories in category_mappings.items():
        df[col] = df[col].astype('category')
        df[col] = df[col].cat.set_categories(categories, ordered=True)
    
    return df
    
In [119]:
##################################
# Applying the preprocessing pipeline
# to the train data
##################################
thyroid_cancer_preprocessed_train = preprocess_dataset(thyroid_cancer_train)
X_preprocessed_train = thyroid_cancer_preprocessed_train.drop('Recurred', axis = 1)
y_preprocessed_train = thyroid_cancer_preprocessed_train['Recurred']
thyroid_cancer_preprocessed_train.to_csv(os.path.join("..", DATASETS_PREPROCESSED_TRAIN_PATH, "thyroid_cancer_preprocessed_train.csv"), index=False)
X_preprocessed_train.to_csv(os.path.join("..", DATASETS_PREPROCESSED_TRAIN_FEATURES_PATH, "X_preprocessed_train.csv"), index=False)
y_preprocessed_train.to_csv(os.path.join("..", DATASETS_PREPROCESSED_TRAIN_TARGET_PATH, "y_preprocessed_train.csv"), index=False)
print('Final Preprocessed Train Dataset Dimensions: ')
display(X_preprocessed_train.shape)
display(y_preprocessed_train.shape)
print('Final Preprocessed Train Target Variable Breakdown: ')
display(y_preprocessed_train.value_counts())
print('Final Preprocessed Train Target Variable Proportion: ')
display(y_preprocessed_train.value_counts(normalize = True))
thyroid_cancer_preprocessed_train.head()
Final Preprocessed Train Dataset Dimensions: 
(204, 10)
(204,)
Final Preprocessed Train Target Variable Breakdown: 
Recurred
No     143
Yes     61
Name: count, dtype: int64
Final Preprocessed Train Target Variable Proportion: 
Recurred
No     0.70098
Yes    0.29902
Name: proportion, dtype: float64
Out[119]:
Age Gender Smoking Physical_Examination Adenopathy Focality Risk T Stage Response Recurred
140 28 F No Multinodular or Diffuse Goiter No Uni-Focal Low T1 to T2 I Excellent No
205 36 F No Normal or Single Nodular Goiter Yes Uni-Focal Low T1 to T2 I Indeterminate or Incomplete No
277 41 M Yes Normal or Single Nodular Goiter No Multi-Focal Intermediate to High T3 to T4b I Excellent No
294 42 M No Normal or Single Nodular Goiter No Multi-Focal Intermediate to High T3 to T4b I Indeterminate or Incomplete No
268 32 F No Normal or Single Nodular Goiter No Uni-Focal Low T3 to T4b I Excellent No
In [120]:
##################################
# Applying the preprocessing pipeline
# to the validation data
##################################
thyroid_cancer_preprocessed_validation = preprocess_dataset(thyroid_cancer_validation)
X_preprocessed_validation = thyroid_cancer_preprocessed_validation.drop('Recurred', axis = 1)
y_preprocessed_validation = thyroid_cancer_preprocessed_validation['Recurred']
thyroid_cancer_preprocessed_validation.to_csv(os.path.join("..", DATASETS_PREPROCESSED_VALIDATION_PATH, "thyroid_cancer_preprocessed_validation.csv"), index=False)
X_preprocessed_validation.to_csv(os.path.join("..", DATASETS_PREPROCESSED_VALIDATION_FEATURES_PATH, "X_preprocessed_validation.csv"), index=False)
y_preprocessed_validation.to_csv(os.path.join("..", DATASETS_PREPROCESSED_VALIDATION_TARGET_PATH, "y_preprocessed_validation.csv"), index=False)
print('Final Preprocessed Validation Dataset Dimensions: ')
display(X_preprocessed_validation.shape)
display(y_preprocessed_validation.shape)
print('Final Preprocessed Validation Target Variable Breakdown: ')
display(y_preprocessed_validation.value_counts())
print('Final Preprocessed Validation Target Variable Proportion: ')
display(y_preprocessed_validation.value_counts(normalize = True))
thyroid_cancer_preprocessed_validation.head()
Final Preprocessed Validation Dataset Dimensions: 
(69, 10)
(69,)
Final Preprocessed Validation Target Variable Breakdown: 
Recurred
No     49
Yes    20
Name: count, dtype: int64
Final Preprocessed Validation Target Variable Proportion: 
Recurred
No     0.710145
Yes    0.289855
Name: proportion, dtype: float64
Out[120]:
Age Gender Smoking Physical_Examination Adenopathy Focality Risk T Stage Response Recurred
173 30 F No Normal or Single Nodular Goiter No Uni-Focal Low T1 to T2 I Indeterminate or Incomplete No
164 29 F No Normal or Single Nodular Goiter No Multi-Focal Low T1 to T2 I Excellent No
256 21 M Yes Normal or Single Nodular Goiter No Uni-Focal Low T3 to T4b I Indeterminate or Incomplete No
348 58 F No Multinodular or Diffuse Goiter Yes Multi-Focal Intermediate to High T3 to T4b II to IVB Indeterminate or Incomplete Yes
131 31 F No Normal or Single Nodular Goiter No Uni-Focal Low T1 to T2 I Excellent No
In [121]:
##################################
# Applying the preprocessing pipeline
# to the test data
##################################
thyroid_cancer_preprocessed_test = preprocess_dataset(thyroid_cancer_test)
X_preprocessed_test = thyroid_cancer_preprocessed_test.drop('Recurred', axis = 1)
y_preprocessed_test = thyroid_cancer_preprocessed_test['Recurred']
thyroid_cancer_preprocessed_test.to_csv(os.path.join("..", DATASETS_PREPROCESSED_TEST_PATH, "thyroid_cancer_preprocessed_test.csv"), index=False)
X_preprocessed_test.to_csv(os.path.join("..", DATASETS_PREPROCESSED_TEST_FEATURES_PATH, "X_preprocessed_test.csv"), index=False)
y_preprocessed_test.to_csv(os.path.join("..", DATASETS_PREPROCESSED_TEST_TARGET_PATH, "y_preprocessed_test.csv"), index=False)
print('Final Preprocessed Test Dataset Dimensions: ')
display(X_preprocessed_test.shape)
display(y_preprocessed_test.shape)
print('Final Preprocessed Test Target Variable Breakdown: ')
display(y_preprocessed_test.value_counts())
print('Final Preprocessed Test Target Variable Proportion: ')
display(y_preprocessed_test.value_counts(normalize = True))
thyroid_cancer_preprocessed_test.head()
Final Preprocessed Test Dataset Dimensions: 
(91, 10)
(91,)
Final Preprocessed Test Target Variable Breakdown: 
Recurred
No     64
Yes    27
Name: count, dtype: int64
Final Preprocessed Test Target Variable Proportion: 
Recurred
No     0.703297
Yes    0.296703
Name: proportion, dtype: float64
Out[121]:
Age Gender Smoking Physical_Examination Adenopathy Focality Risk T Stage Response Recurred
345 25 F No Multinodular or Diffuse Goiter Yes Multi-Focal Intermediate to High T3 to T4b I Indeterminate or Incomplete Yes
249 46 F No Normal or Single Nodular Goiter No Multi-Focal Low T3 to T4b I Excellent No
83 40 F No Normal or Single Nodular Goiter No Uni-Focal Intermediate to High T1 to T2 I Excellent No
184 67 F No Normal or Single Nodular Goiter No Uni-Focal Low T1 to T2 I Excellent No
146 25 F No Multinodular or Diffuse Goiter No Uni-Focal Low T1 to T2 I Indeterminate or Incomplete No
In [122]:
##################################
# Defining a function to compute
# model performance
##################################
def model_performance_evaluation(y_true, y_pred):
    metric_name = ['Accuracy','Precision','Recall','F1','AUROC']
    metric_value = [accuracy_score(y_true, y_pred),
                   precision_score(y_true, y_pred),
                   recall_score(y_true, y_pred),
                   f1_score(y_true, y_pred),
                   roc_auc_score(y_true, y_pred)]    
    metric_summary = pd.DataFrame(zip(metric_name, metric_value),
                                  columns=['metric_name','metric_value']) 
    return(metric_summary)
    

1.7. Bagged Model Development ¶

Bagging (Boostrap Aggregating) is an ensemble learning technique that reduces model variance by training multiple instances of the same algorithm on different randomly sampled subsets of the training data. The fundamental problem bagging aims to solve is overfitting, particularly in high-variance models. By generating multiple bootstrap samples—random subsets created through sampling with replacement — bagging ensures that each model is trained on slightly different data, making the overall prediction more stable. In classification problems, the final output is obtained by majority voting among the individual models, while in regression, their predictions are averaged. Bagging is particularly effective when dealing with noisy datasets, as it smooths out individual model errors. However, its effectiveness is limited for low-variance models, and the requirement to train multiple models increases computational cost.

1.7.1 Random Forest ¶

Random Forest is an ensemble learning method that builds multiple decision trees and combines their outputs to improve prediction accuracy and robustness in binary classification. Instead of relying on a single decision tree, it aggregates multiple trees, reducing overfitting and increasing generalizability. The algorithm works by training individual decision trees on bootstrapped samples of the dataset, where each tree is trained on a slightly different subset of data. Additionally, at each decision node, a random subset of features is considered for splitting, adding further diversity among the trees. The final classification is determined by majority voting across all trees. The main advantages of Random Forest include its resilience to overfitting, ability to handle high-dimensional data, and robustness against noisy data. However, it has limitations, such as higher computational cost due to multiple trees and reduced interpretability compared to a single decision tree. It can also struggle with highly imbalanced data unless additional techniques like class weighting are applied.

  1. The random forest model from the sklearn.ensemble Python library API was implemented.
  2. The model contains 4 hyperparameters for tuning:
    • criterion = function to measure the quality of a split made to vary between gini and entropy
    • max_depth = maximum depth of the tree made to vary between 3 and 6
    • min_samples_leaf = minimum number of samples required to be at a leaf node made to vary between 5 and 10
    • n_estimators = number of base estimators in the ensemble made to vary between 100 and 200
  3. A special hyperparameter (class_weight = balanced) was fixed to address the minimal 2:1 class imbalance observed between the No and Yes Recurred categories.
  4. Hyperparameter tuning was conducted using the 5-cycle 5-fold cross-validation method with optimal model performance using the F1 score determined for:
    • criterion = entropy
    • max_depth = 6
    • min_samples_leaf = 10
    • n_estimators = 200
  5. The apparent model performance of the optimal model is summarized as follows:
    • Accuracy = 0.8985
    • Precision = 0.7826
    • Recall = 0.9000
    • F1 Score = 0.8372
    • AUROC = 0.8989
  6. The independent validation model performance of the optimal model is summarized as follows:
    • Accuracy = 0.8985
    • Precision = 0.7826
    • Recall = 0.9000
    • F1 Score = 0.8372
    • AUROC = 0.8989
  7. Sufficiently comparable apparent and independent validation model performance observed that might be indicative of the absence of excessive model overfitting.
In [123]:
##################################
# Defining the categorical preprocessing parameters
##################################
categorical_features = ['Gender','Smoking','Physical_Examination','Adenopathy','Focality','Risk','T','Stage','Response']
categorical_transformer = OrdinalEncoder()
categorical_preprocessor = ColumnTransformer(transformers=[
    ('cat', categorical_transformer, categorical_features)],
                                             remainder='passthrough',
                                             force_int_remainder_cols=False)
In [124]:
##################################
# Defining the preprocessing and modeling pipeline parameters
##################################
bagged_rf_pipeline = Pipeline([
    ('categorical_preprocessor', categorical_preprocessor),
    ('bagged_rf_model', RandomForestClassifier(class_weight='balanced', 
                                               random_state=987654321))
])
In [125]:
##################################
# Defining hyperparameter grid
##################################
bagged_rf_hyperparameter_grid = {
    'bagged_rf_model__criterion': ['gini', 'entropy'],
    'bagged_rf_model__max_depth': [3, 6],
    'bagged_rf_model__min_samples_leaf': [5, 10],
    'bagged_rf_model__n_estimators': [100, 200]
}
In [126]:
##################################
# Defining the cross-validation strategy (5-cycle 5-fold CV)
##################################
cv_strategy = RepeatedStratifiedKFold(n_splits=5, 
                                      n_repeats=5, 
                                      random_state=987654321)
In [127]:
##################################
# Performing Grid Search with cross-validation
##################################
bagged_rf_grid_search = GridSearchCV(
    estimator=bagged_rf_pipeline,
    param_grid=bagged_rf_hyperparameter_grid,
    scoring='f1',
    cv=cv_strategy,
    n_jobs=-1,
    verbose=1
)
In [128]:
##################################
# Encoding the response variables
# for model evaluation
##################################
y_encoder = OrdinalEncoder()
y_encoder.fit(y_preprocessed_train.values.reshape(-1, 1))
y_preprocessed_train_encoded = y_encoder.transform(y_preprocessed_train.values.reshape(-1, 1)).ravel()
y_preprocessed_validation_encoded = y_encoder.transform(y_preprocessed_validation.values.reshape(-1, 1)).ravel()
In [129]:
##################################
# Fitting GridSearchCV
##################################
bagged_rf_grid_search.fit(X_preprocessed_train, y_preprocessed_train_encoded)
Fitting 25 folds for each of 16 candidates, totalling 400 fits
Out[129]:
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])])),
                                       ('bagged_rf_model',
                                        RandomForestClassifier(class_weight='balanced',
                                                               random_state=987654321))]),
             n_jobs=-1,
             param_grid={'bagged_rf_model__criterion': ['gini', 'entropy'],
                         'bagged_rf_model__max_depth': [3, 6],
                         'bagged_rf_model__min_samples_leaf': [5, 10],
                         'bagged_rf_model__n_estimators': [100, 200]},
             scoring='f1', verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])])),
                                       ('bagged_rf_model',
                                        RandomForestClassifier(class_weight='balanced',
                                                               random_state=987654321))]),
             n_jobs=-1,
             param_grid={'bagged_rf_model__criterion': ['gini', 'entropy'],
                         'bagged_rf_model__max_depth': [3, 6],
                         'bagged_rf_model__min_samples_leaf': [5, 10],
                         'bagged_rf_model__n_estimators': [100, 200]},
             scoring='f1', verbose=1)
Pipeline(steps=[('categorical_preprocessor',
                 ColumnTransformer(force_int_remainder_cols=False,
                                   remainder='passthrough',
                                   transformers=[('cat', OrdinalEncoder(),
                                                  ['Gender', 'Smoking',
                                                   'Physical_Examination',
                                                   'Adenopathy', 'Focality',
                                                   'Risk', 'T', 'Stage',
                                                   'Response'])])),
                ('bagged_rf_model',
                 RandomForestClassifier(class_weight='balanced',
                                        criterion='entropy', max_depth=6,
                                        min_samples_leaf=10, n_estimators=200,
                                        random_state=987654321))])
ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough',
                  transformers=[('cat', OrdinalEncoder(),
                                 ['Gender', 'Smoking', 'Physical_Examination',
                                  'Adenopathy', 'Focality', 'Risk', 'T',
                                  'Stage', 'Response'])])
['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response']
OrdinalEncoder()
['Age']
passthrough
RandomForestClassifier(class_weight='balanced', criterion='entropy',
                       max_depth=6, min_samples_leaf=10, n_estimators=200,
                       random_state=987654321)
In [130]:
##################################
# Identifying the best model
##################################
bagged_rf_optimal = bagged_rf_grid_search.best_estimator_
In [131]:
##################################
# Evaluating the F1 scores
# on the training, cross-validation, and validation data
##################################
bagged_rf_optimal_f1_cv = bagged_rf_grid_search.best_score_
bagged_rf_optimal_f1_train = f1_score(y_preprocessed_train_encoded, bagged_rf_optimal.predict(X_preprocessed_train))
bagged_rf_optimal_f1_validation = f1_score(y_preprocessed_validation_encoded, bagged_rf_optimal.predict(X_preprocessed_validation))
In [132]:
##################################
# Identifying the optimal model
##################################
print('Best Bagged Model - Random Forest: ')
print(f"Best Random Forest Hyperparameters: {bagged_rf_grid_search.best_params_}")
Best Bagged Model - Random Forest: 
Best Random Forest Hyperparameters: {'bagged_rf_model__criterion': 'entropy', 'bagged_rf_model__max_depth': 6, 'bagged_rf_model__min_samples_leaf': 10, 'bagged_rf_model__n_estimators': 200}
In [133]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the training and cross-validated data
# to assess overfitting optimism
##################################
print(f"F1 Score on Cross-Validated Data: {bagged_rf_optimal_f1_cv:.4f}")
print(f"F1 Score on Training Data: {bagged_rf_optimal_f1_train:.4f}")
print("\nClassification Report on Train Data:\n", classification_report(y_preprocessed_train_encoded, bagged_rf_optimal.predict(X_preprocessed_train)))
F1 Score on Cross-Validated Data: 0.8218
F1 Score on Training Data: 0.8333

Classification Report on Train Data:
               precision    recall  f1-score   support

         0.0       0.95      0.89      0.92       143
         1.0       0.77      0.90      0.83        61

    accuracy                           0.89       204
   macro avg       0.86      0.89      0.88       204
weighted avg       0.90      0.89      0.89       204

In [134]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the train data
##################################
cm_raw = confusion_matrix(y_preprocessed_train_encoded, bagged_rf_optimal.predict(X_preprocessed_train))
cm_normalized = confusion_matrix(y_preprocessed_train_encoded, bagged_rf_optimal.predict(X_preprocessed_train), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Random Forest Train Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Random Forest Train Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [135]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
# to assess overfitting optimism
##################################
print(f"F1 Score on Validation Data: {bagged_rf_optimal_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_preprocessed_validation_encoded, bagged_rf_optimal.predict(X_preprocessed_validation)))
F1 Score on Validation Data: 0.8372

Classification Report on Validation Data:
               precision    recall  f1-score   support

         0.0       0.96      0.90      0.93        49
         1.0       0.78      0.90      0.84        20

    accuracy                           0.90        69
   macro avg       0.87      0.90      0.88        69
weighted avg       0.91      0.90      0.90        69

In [136]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_preprocessed_validation_encoded, bagged_rf_optimal.predict(X_preprocessed_validation))
cm_normalized = confusion_matrix(y_preprocessed_validation_encoded, bagged_rf_optimal.predict(X_preprocessed_validation), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Random Forest Validation Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Random Forest Validation Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [137]:
##################################
# Gathering the model evaluation metrics
# for the train data
##################################
bagged_rf_optimal_train = model_performance_evaluation(y_preprocessed_train_encoded, bagged_rf_optimal.predict(X_preprocessed_train))
bagged_rf_optimal_train['model'] = ['bagged_rf_optimal'] * 5
bagged_rf_optimal_train['set'] = ['train'] * 5
print('Optimal Random Forest Train Performance Metrics: ')
display(bagged_rf_optimal_train)
Optimal Random Forest Train Performance Metrics: 
metric_name metric_value model set
0 Accuracy 0.892157 bagged_rf_optimal train
1 Precision 0.774648 bagged_rf_optimal train
2 Recall 0.901639 bagged_rf_optimal train
3 F1 0.833333 bagged_rf_optimal train
4 AUROC 0.894876 bagged_rf_optimal train
In [138]:
##################################
# Gathering the model evaluation metrics
# for the validation data
##################################
bagged_rf_optimal_validation = model_performance_evaluation(y_preprocessed_validation_encoded, bagged_rf_optimal.predict(X_preprocessed_validation))
bagged_rf_optimal_validation['model'] = ['bagged_rf_optimal'] * 5
bagged_rf_optimal_validation['set'] = ['validation'] * 5
print('Optimal Random Forest Validation Performance Metrics: ')
display(bagged_rf_optimal_validation)
Optimal Random Forest Validation Performance Metrics: 
metric_name metric_value model set
0 Accuracy 0.898551 bagged_rf_optimal validation
1 Precision 0.782609 bagged_rf_optimal validation
2 Recall 0.900000 bagged_rf_optimal validation
3 F1 0.837209 bagged_rf_optimal validation
4 AUROC 0.898980 bagged_rf_optimal validation
In [139]:
##################################
# Saving the best individual model
# developed from the train data
################################## 
joblib.dump(bagged_rf_optimal, 
            os.path.join("..", MODELS_PATH, "bagged_model_random_forest_optimal.pkl"))
Out[139]:
['..\\models\\bagged_model_random_forest_optimal.pkl']

1.7.2 Extra Trees ¶

Extra Trees (Extremely Randomized Trees) is a variation of Random Forest that introduces more randomness into tree construction to improve generalization. Similar to Random Forest, it builds multiple decision trees on bootstrapped datasets, but it differs in how it determines splits—rather than selecting the best split based on information gain or Gini impurity, Extra Trees splits randomly at each node from a subset of features. This extra randomness can prevent overfitting and make the model more robust to small variations in data. The key advantages of Extra Trees include its speed, as it does not need to search for the best split at each node, and its ability to handle large datasets efficiently. However, since it relies on random splits, it may not perform as well as Random Forest on some datasets, especially when strong feature interactions exist. Additionally, its randomness can make the model harder to interpret and tune effectively.

  1. The extra trees model from the sklearn.ensemble Python library API was implemented.
  2. The model contains 4 hyperparameters for tuning:
    • criterion = function to measure the quality of a split made to vary between gini and entropy
    • max_depth = maximum depth of the tree made to vary between 3 and 6
    • min_samples_leaf = minimum number of samples required to be at a leaf node made to vary between 5 and 10
    • n_estimators = number of base estimators in the ensemble made to vary between 100 and 200
  3. A special hyperparameter (class_weight = balanced) was fixed to address the minimal 2:1 class imbalance observed between the No and Yes Recurred categories.
  4. Hyperparameter tuning was conducted using the 5-cycle 5-fold cross-validation method with optimal model performance using the F1 score determined for:
    • criterion = entropy
    • max_depth = 6
    • min_samples_leaf = 10
    • n_estimators = 200
  5. The apparent model performance of the optimal model is summarized as follows:
    • Accuracy = 0.8921
    • Precision = 0.7746
    • Recall = 0.9016
    • F1 Score = 0.8333
    • AUROC = 0.8948
  6. The independent validation model performance of the optimal model is summarized as follows:
    • Accuracy = 0.8985
    • Precision = 0.7826
    • Recall = 0.9000
    • F1 Score = 0.8372
    • AUROC = 0.8989
  7. Sufficiently comparable apparent and independent validation model performance observed that might be indicative of the absence of excessive model overfitting.
In [140]:
##################################
# Defining the categorical preprocessing parameters
##################################
categorical_features = ['Gender','Smoking','Physical_Examination','Adenopathy','Focality','Risk','T','Stage','Response']
categorical_transformer = OrdinalEncoder()
categorical_preprocessor = ColumnTransformer(transformers=[
    ('cat', categorical_transformer, categorical_features)],
                                             remainder='passthrough',
                                             force_int_remainder_cols=False)
In [141]:
##################################
# Defining the preprocessing and modeling pipeline parameters
##################################
bagged_et_pipeline = Pipeline([
    ('categorical_preprocessor', categorical_preprocessor),
    ('bagged_et_model', ExtraTreesClassifier(class_weight='balanced', 
                                               random_state=987654321))
])
In [142]:
##################################
# Defining hyperparameter grid
##################################
bagged_et_hyperparameter_grid = {
    'bagged_et_model__criterion': ['gini', 'entropy'],
    'bagged_et_model__max_depth': [3, 6],
    'bagged_et_model__min_samples_leaf': [5, 10],
    'bagged_et_model__n_estimators': [100, 200]
}
In [143]:
##################################
# Defining the cross-validation strategy (5-cycle 5-fold CV)
##################################
cv_strategy = RepeatedStratifiedKFold(n_splits=5, 
                                      n_repeats=5, 
                                      random_state=987654321)
In [144]:
##################################
# Performing Grid Search with cross-validation
##################################
bagged_et_grid_search = GridSearchCV(
    estimator=bagged_et_pipeline,
    param_grid=bagged_et_hyperparameter_grid,
    scoring='f1',
    cv=cv_strategy,
    n_jobs=-1,
    verbose=1
)
In [145]:
##################################
# Encoding the response variables
# for model evaluation
##################################
y_encoder = OrdinalEncoder()
y_encoder.fit(y_preprocessed_train.values.reshape(-1, 1))
y_preprocessed_train_encoded = y_encoder.transform(y_preprocessed_train.values.reshape(-1, 1)).ravel()
y_preprocessed_validation_encoded = y_encoder.transform(y_preprocessed_validation.values.reshape(-1, 1)).ravel()
In [146]:
##################################
# Fitting GridSearchCV
##################################
bagged_et_grid_search.fit(X_preprocessed_train, y_preprocessed_train_encoded)
Fitting 25 folds for each of 16 candidates, totalling 400 fits
Out[146]:
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])])),
                                       ('bagged_et_model',
                                        ExtraTreesClassifier(class_weight='balanced',
                                                             random_state=987654321))]),
             n_jobs=-1,
             param_grid={'bagged_et_model__criterion': ['gini', 'entropy'],
                         'bagged_et_model__max_depth': [3, 6],
                         'bagged_et_model__min_samples_leaf': [5, 10],
                         'bagged_et_model__n_estimators': [100, 200]},
             scoring='f1', verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])])),
                                       ('bagged_et_model',
                                        ExtraTreesClassifier(class_weight='balanced',
                                                             random_state=987654321))]),
             n_jobs=-1,
             param_grid={'bagged_et_model__criterion': ['gini', 'entropy'],
                         'bagged_et_model__max_depth': [3, 6],
                         'bagged_et_model__min_samples_leaf': [5, 10],
                         'bagged_et_model__n_estimators': [100, 200]},
             scoring='f1', verbose=1)
Pipeline(steps=[('categorical_preprocessor',
                 ColumnTransformer(force_int_remainder_cols=False,
                                   remainder='passthrough',
                                   transformers=[('cat', OrdinalEncoder(),
                                                  ['Gender', 'Smoking',
                                                   'Physical_Examination',
                                                   'Adenopathy', 'Focality',
                                                   'Risk', 'T', 'Stage',
                                                   'Response'])])),
                ('bagged_et_model',
                 ExtraTreesClassifier(class_weight='balanced',
                                      criterion='entropy', max_depth=6,
                                      min_samples_leaf=10, n_estimators=200,
                                      random_state=987654321))])
ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough',
                  transformers=[('cat', OrdinalEncoder(),
                                 ['Gender', 'Smoking', 'Physical_Examination',
                                  'Adenopathy', 'Focality', 'Risk', 'T',
                                  'Stage', 'Response'])])
['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response']
OrdinalEncoder()
['Age']
passthrough
ExtraTreesClassifier(class_weight='balanced', criterion='entropy', max_depth=6,
                     min_samples_leaf=10, n_estimators=200,
                     random_state=987654321)
In [147]:
##################################
# Identifying the best model
##################################
bagged_et_optimal = bagged_et_grid_search.best_estimator_
In [148]:
##################################
# Evaluating the F1 scores
# on the training, cross-validation, and validation data
##################################
bagged_et_optimal_f1_cv = bagged_et_grid_search.best_score_
bagged_et_optimal_f1_train = f1_score(y_preprocessed_train_encoded, bagged_et_optimal.predict(X_preprocessed_train))
bagged_et_optimal_f1_validation = f1_score(y_preprocessed_validation_encoded, bagged_et_optimal.predict(X_preprocessed_validation))
In [149]:
##################################
# Identifying the optimal model
##################################
print('Best Bagged Model – Extra Trees: ')
print(f"Best Extra Trees Hyperparameters: {bagged_et_grid_search.best_params_}")
Best Bagged Model – Extra Trees: 
Best Extra Trees Hyperparameters: {'bagged_et_model__criterion': 'entropy', 'bagged_et_model__max_depth': 6, 'bagged_et_model__min_samples_leaf': 10, 'bagged_et_model__n_estimators': 200}
In [150]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the training and cross-validated data
# to assess overfitting optimism
##################################
print(f"F1 Score on Cross-Validated Data: {bagged_et_optimal_f1_cv:.4f}")
print(f"F1 Score on Training Data: {bagged_et_optimal_f1_train:.4f}")
print("\nClassification Report on Train Data:\n", classification_report(y_preprocessed_train_encoded, bagged_et_optimal.predict(X_preprocessed_train)))
F1 Score on Cross-Validated Data: 0.8101
F1 Score on Training Data: 0.8333

Classification Report on Train Data:
               precision    recall  f1-score   support

         0.0       0.95      0.89      0.92       143
         1.0       0.77      0.90      0.83        61

    accuracy                           0.89       204
   macro avg       0.86      0.89      0.88       204
weighted avg       0.90      0.89      0.89       204

In [151]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the train data
##################################
cm_raw = confusion_matrix(y_preprocessed_train_encoded, bagged_et_optimal.predict(X_preprocessed_train))
cm_normalized = confusion_matrix(y_preprocessed_train_encoded, bagged_et_optimal.predict(X_preprocessed_train), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Extra Trees Train Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Extra Trees Train Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [152]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
# to assess overfitting optimism
##################################
print(f"F1 Score on Validation Data: {bagged_et_optimal_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_preprocessed_validation_encoded, bagged_et_optimal.predict(X_preprocessed_validation)))
F1 Score on Validation Data: 0.8372

Classification Report on Validation Data:
               precision    recall  f1-score   support

         0.0       0.96      0.90      0.93        49
         1.0       0.78      0.90      0.84        20

    accuracy                           0.90        69
   macro avg       0.87      0.90      0.88        69
weighted avg       0.91      0.90      0.90        69

In [153]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_preprocessed_validation_encoded, bagged_et_optimal.predict(X_preprocessed_validation))
cm_normalized = confusion_matrix(y_preprocessed_validation_encoded, bagged_et_optimal.predict(X_preprocessed_validation), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Extra Trees Validation Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Extra Trees Validation Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [154]:
##################################
# Gathering the model evaluation metrics
# for the train data
##################################
bagged_et_optimal_train = model_performance_evaluation(y_preprocessed_train_encoded, bagged_et_optimal.predict(X_preprocessed_train))
bagged_et_optimal_train['model'] = ['bagged_et_optimal'] * 5
bagged_et_optimal_train['set'] = ['train'] * 5
print('Optimal Extra Trees Train Performance Metrics: ')
display(bagged_et_optimal_train)
Optimal Extra Trees Train Performance Metrics: 
metric_name metric_value model set
0 Accuracy 0.892157 bagged_et_optimal train
1 Precision 0.774648 bagged_et_optimal train
2 Recall 0.901639 bagged_et_optimal train
3 F1 0.833333 bagged_et_optimal train
4 AUROC 0.894876 bagged_et_optimal train
In [155]:
##################################
# Gathering the model evaluation metrics
# for the validation data
##################################
bagged_et_optimal_validation = model_performance_evaluation(y_preprocessed_validation_encoded, bagged_et_optimal.predict(X_preprocessed_validation))
bagged_et_optimal_validation['model'] = ['bagged_et_optimal'] * 5
bagged_et_optimal_validation['set'] = ['validation'] * 5
print('Optimal Extra Trees Validation Performance Metrics: ')
display(bagged_et_optimal_validation)
Optimal Extra Trees Validation Performance Metrics: 
metric_name metric_value model set
0 Accuracy 0.898551 bagged_et_optimal validation
1 Precision 0.782609 bagged_et_optimal validation
2 Recall 0.900000 bagged_et_optimal validation
3 F1 0.837209 bagged_et_optimal validation
4 AUROC 0.898980 bagged_et_optimal validation
In [156]:
##################################
# Saving the best individual model
# developed from the train data
################################## 
joblib.dump(bagged_et_optimal, 
            os.path.join("..", MODELS_PATH, "bagged_model_extra_trees_optimal.pkl"))
Out[156]:
['..\\models\\bagged_model_extra_trees_optimal.pkl']

1.7.3 Bagged Decision Trees ¶

Bagged Decision Trees is an ensemble method that reduces overfitting by training multiple decision trees on different bootstrap samples and aggregating their predictions. Unlike Random Forest, all features are considered when finding the best split at each node, making it less random but still improving stability compared to a single decision tree. The process involves drawing multiple random subsets of the training data (with replacement), training a decision tree on each subset, and combining the predictions using majority voting for classification. This technique helps to reduce variance and prevent overfitting, leading to more stable and accurate predictions. The main advantage of Bagged Decision Trees is that they perform well on complex datasets without requiring deep tuning. However, the downside is that they require significant computational power and memory, as multiple trees must be trained and stored. Additionally, unlike boosting methods, bagging does not inherently improve bias, meaning the performance is still dependent on the base decision tree's predictive power.

  1. The bagging classifier and decision tree models from the sklearn.ensemble and sklearn.tree Python library APIs were implemented.
  2. The model contains 4 hyperparameters for tuning:
    • criterion = function to measure the quality of a split made to vary between gini and entropy
    • max_depth = maximum depth of the tree made to vary between 3 and 6
    • min_samples_leaf = minimum number of samples required to be at a leaf node made to vary between 5 and 10
    • n_estimators = number of base estimators in the ensemble made to vary between 100 and 200
  3. A special hyperparameter (class_weight = balanced) was fixed to address the minimal 2:1 class imbalance observed between the No and Yes Recurred categories.
  4. Hyperparameter tuning was conducted using the 5-cycle 5-fold cross-validation method with optimal model performance using the F1 score determined for:
    • criterion = gini
    • max_depth = 6
    • min_samples_leaf = 5
    • n_estimators = 200
  5. The apparent model performance of the optimal model is summarized as follows:
    • Accuracy = 0.9019
    • Precision = 0.7971
    • Recall = 0.9016
    • F1 Score = 0.8461
    • AUROC = 0.9018
  6. The independent validation model performance of the optimal model is summarized as follows:
    • Accuracy = 0.9130
    • Precision = 0.8181
    • Recall = 0.9000
    • F1 Score = 0.8571
    • AUROC = 0.9091
  7. Sufficiently comparable apparent and independent validation model performance observed that might be indicative of the absence of excessive model overfitting.
In [157]:
##################################
# Defining the categorical preprocessing parameters
##################################
categorical_features = ['Gender','Smoking','Physical_Examination','Adenopathy','Focality','Risk','T','Stage','Response']
categorical_transformer = OrdinalEncoder()
categorical_preprocessor = ColumnTransformer(transformers=[
    ('cat', categorical_transformer, categorical_features)],
                                             remainder='passthrough',
                                             force_int_remainder_cols=False)
In [158]:
##################################
# Defining the preprocessing and modeling pipeline parameters
##################################
bagged_bdt_pipeline = Pipeline([
    ('categorical_preprocessor', categorical_preprocessor),
    ('bagged_bdt_model', BaggingClassifier(estimator=DecisionTreeClassifier(class_weight='balanced', 
                                                                            random_state=987654321),
                                           random_state=987654321))
])
In [159]:
##################################
# Defining hyperparameter grid
##################################
bagged_bdt_hyperparameter_grid = {
    'bagged_bdt_model__estimator__criterion': ['gini', 'entropy'],
    'bagged_bdt_model__estimator__max_depth': [3, 6],
    'bagged_bdt_model__estimator__min_samples_leaf': [5, 10],
    'bagged_bdt_model__n_estimators': [100, 200]
}
In [160]:
##################################
# Defining the cross-validation strategy (5-cycle 5-fold CV)
##################################
cv_strategy = RepeatedStratifiedKFold(n_splits=5, 
                                      n_repeats=5, 
                                      random_state=987654321)
In [161]:
##################################
# Performing Grid Search with cross-validation
##################################
bagged_bdt_grid_search = GridSearchCV(
    estimator=bagged_bdt_pipeline,
    param_grid=bagged_bdt_hyperparameter_grid,
    scoring='f1',
    cv=cv_strategy,
    n_jobs=-1,
    verbose=1
)
In [162]:
##################################
# Encoding the response variables
# for model evaluation
##################################
y_encoder = OrdinalEncoder()
y_encoder.fit(y_preprocessed_train.values.reshape(-1, 1))
y_preprocessed_train_encoded = y_encoder.transform(y_preprocessed_train.values.reshape(-1, 1)).ravel()
y_preprocessed_validation_encoded = y_encoder.transform(y_preprocessed_validation.values.reshape(-1, 1)).ravel()
In [163]:
##################################
# Fitting GridSearchCV
##################################
bagged_bdt_grid_search.fit(X_preprocessed_train, y_preprocessed_train_encoded)
Fitting 25 folds for each of 16 candidates, totalling 400 fits
Out[163]:
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])...
                                        BaggingClassifier(estimator=DecisionTreeClassifier(class_weight='balanced',
                                                                                           random_state=987654321),
                                                          random_state=987654321))]),
             n_jobs=-1,
             param_grid={'bagged_bdt_model__estimator__criterion': ['gini',
                                                                    'entropy'],
                         'bagged_bdt_model__estimator__max_depth': [3, 6],
                         'bagged_bdt_model__estimator__min_samples_leaf': [5,
                                                                           10],
                         'bagged_bdt_model__n_estimators': [100, 200]},
             scoring='f1', verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])...
                                        BaggingClassifier(estimator=DecisionTreeClassifier(class_weight='balanced',
                                                                                           random_state=987654321),
                                                          random_state=987654321))]),
             n_jobs=-1,
             param_grid={'bagged_bdt_model__estimator__criterion': ['gini',
                                                                    'entropy'],
                         'bagged_bdt_model__estimator__max_depth': [3, 6],
                         'bagged_bdt_model__estimator__min_samples_leaf': [5,
                                                                           10],
                         'bagged_bdt_model__n_estimators': [100, 200]},
             scoring='f1', verbose=1)
Pipeline(steps=[('categorical_preprocessor',
                 ColumnTransformer(force_int_remainder_cols=False,
                                   remainder='passthrough',
                                   transformers=[('cat', OrdinalEncoder(),
                                                  ['Gender', 'Smoking',
                                                   'Physical_Examination',
                                                   'Adenopathy', 'Focality',
                                                   'Risk', 'T', 'Stage',
                                                   'Response'])])),
                ('bagged_bdt_model',
                 BaggingClassifier(estimator=DecisionTreeClassifier(class_weight='balanced',
                                                                    max_depth=6,
                                                                    min_samples_leaf=5,
                                                                    random_state=987654321),
                                   n_estimators=200, random_state=987654321))])
ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough',
                  transformers=[('cat', OrdinalEncoder(),
                                 ['Gender', 'Smoking', 'Physical_Examination',
                                  'Adenopathy', 'Focality', 'Risk', 'T',
                                  'Stage', 'Response'])])
['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response']
OrdinalEncoder()
['Age']
passthrough
BaggingClassifier(estimator=DecisionTreeClassifier(class_weight='balanced',
                                                   max_depth=6,
                                                   min_samples_leaf=5,
                                                   random_state=987654321),
                  n_estimators=200, random_state=987654321)
DecisionTreeClassifier(class_weight='balanced', max_depth=6, min_samples_leaf=5,
                       random_state=987654321)
DecisionTreeClassifier(class_weight='balanced', max_depth=6, min_samples_leaf=5,
                       random_state=987654321)
In [164]:
##################################
# Identifying the best model
##################################
bagged_bdt_optimal = bagged_bdt_grid_search.best_estimator_
In [165]:
##################################
# Evaluating the F1 scores
# on the training, cross-validation, and validation data
##################################
bagged_bdt_optimal_f1_cv = bagged_bdt_grid_search.best_score_
bagged_bdt_optimal_f1_train = f1_score(y_preprocessed_train_encoded, bagged_bdt_optimal.predict(X_preprocessed_train))
bagged_bdt_optimal_f1_validation = f1_score(y_preprocessed_validation_encoded, bagged_bdt_optimal.predict(X_preprocessed_validation))
In [166]:
##################################
# Identifying the optimal model
##################################
print('Best Bagged Model – Bagged Decision Trees: ')
print(f"Best Bagged Decision Trees Hyperparameters: {bagged_bdt_grid_search.best_params_}")
Best Bagged Model – Bagged Decision Trees: 
Best Bagged Decision Trees Hyperparameters: {'bagged_bdt_model__estimator__criterion': 'gini', 'bagged_bdt_model__estimator__max_depth': 6, 'bagged_bdt_model__estimator__min_samples_leaf': 5, 'bagged_bdt_model__n_estimators': 200}
In [167]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the training and cross-validated data
# to assess overfitting optimism
##################################
print(f"F1 Score on Cross-Validated Data: {bagged_bdt_optimal_f1_cv:.4f}")
print(f"F1 Score on Training Data: {bagged_bdt_optimal_f1_train:.4f}")
print("\nClassification Report on Train Data:\n", classification_report(y_preprocessed_train_encoded, bagged_bdt_optimal.predict(X_preprocessed_train)))
F1 Score on Cross-Validated Data: 0.8287
F1 Score on Training Data: 0.8462

Classification Report on Train Data:
               precision    recall  f1-score   support

         0.0       0.96      0.90      0.93       143
         1.0       0.80      0.90      0.85        61

    accuracy                           0.90       204
   macro avg       0.88      0.90      0.89       204
weighted avg       0.91      0.90      0.90       204

In [168]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the train data
##################################
cm_raw = confusion_matrix(y_preprocessed_train_encoded, bagged_bdt_optimal.predict(X_preprocessed_train))
cm_normalized = confusion_matrix(y_preprocessed_train_encoded, bagged_bdt_optimal.predict(X_preprocessed_train), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Bagged Decision Trees Train Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Bagged Decision Trees Train Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [169]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
# to assess overfitting optimism
##################################
print(f"F1 Score on Validation Data: {bagged_bdt_optimal_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_preprocessed_validation_encoded, bagged_bdt_optimal.predict(X_preprocessed_validation)))
F1 Score on Validation Data: 0.8571

Classification Report on Validation Data:
               precision    recall  f1-score   support

         0.0       0.96      0.92      0.94        49
         1.0       0.82      0.90      0.86        20

    accuracy                           0.91        69
   macro avg       0.89      0.91      0.90        69
weighted avg       0.92      0.91      0.91        69

In [170]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_preprocessed_validation_encoded, bagged_bdt_optimal.predict(X_preprocessed_validation))
cm_normalized = confusion_matrix(y_preprocessed_validation_encoded, bagged_bdt_optimal.predict(X_preprocessed_validation), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Bagged Decision Trees Validation Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Bagged Decision Trees Validation Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [171]:
##################################
# Gathering the model evaluation metrics
# for the train data
##################################
bagged_bdt_optimal_train = model_performance_evaluation(y_preprocessed_train_encoded, bagged_bdt_optimal.predict(X_preprocessed_train))
bagged_bdt_optimal_train['model'] = ['bagged_bdt_optimal'] * 5
bagged_bdt_optimal_train['set'] = ['train'] * 5
print('Optimal Bagged Decision Trees Train Performance Metrics: ')
display(bagged_bdt_optimal_train)
Optimal Bagged Decision Trees Train Performance Metrics: 
metric_name metric_value model set
0 Accuracy 0.901961 bagged_bdt_optimal train
1 Precision 0.797101 bagged_bdt_optimal train
2 Recall 0.901639 bagged_bdt_optimal train
3 F1 0.846154 bagged_bdt_optimal train
4 AUROC 0.901869 bagged_bdt_optimal train
In [172]:
##################################
# Gathering the model evaluation metrics
# for the validation data
##################################
bagged_bdt_optimal_validation = model_performance_evaluation(y_preprocessed_validation_encoded, bagged_bdt_optimal.predict(X_preprocessed_validation))
bagged_bdt_optimal_validation['model'] = ['bagged_bdt_optimal'] * 5
bagged_bdt_optimal_validation['set'] = ['validation'] * 5
print('Optimal Bagged Decision Trees Validation Performance Metrics: ')
display(bagged_bdt_optimal_validation)
Optimal Bagged Decision Trees Validation Performance Metrics: 
metric_name metric_value model set
0 Accuracy 0.913043 bagged_bdt_optimal validation
1 Precision 0.818182 bagged_bdt_optimal validation
2 Recall 0.900000 bagged_bdt_optimal validation
3 F1 0.857143 bagged_bdt_optimal validation
4 AUROC 0.909184 bagged_bdt_optimal validation
In [173]:
##################################
# Saving the best individual model
# developed from the train data
################################## 
joblib.dump(bagged_bdt_optimal, 
            os.path.join("..", MODELS_PATH, "bagged_model_bagged_decision_trees_optimal.pkl"))
Out[173]:
['..\\models\\bagged_model_bagged_decision_trees_optimal.pkl']

1.7.4 Bagged Logistic Regression ¶

Bagged Logistic Regression applies bootstrap aggregation (bagging) to logistic regression, improving its stability and generalization. Logistic regression is inherently a high-bias model, meaning it can underperform on complex, non-linear data. Bagging helps by training multiple logistic regression models on different bootstrap samples and averaging their probability outputs for final classification. This reduces variance and improves robustness, especially when dealing with small datasets prone to fluctuations. The main advantage is that it stabilizes logistic regression by reducing overfitting without adding significant complexity. Additionally, it works well when the relationship between features and the target variable is approximately linear. However, since logistic regression is a weak learner, bagging does not dramatically boost performance on highly non-linear problems. It is also computationally expensive compared to a single logistic regression model, and unlike boosting, it does not correct the inherent bias of logistic regression.

  1. The bagging classifier and logistic regression models from the sklearn.ensemble and sklearn.linear_model Python library APIs were implemented.
  2. The model contains 4 hyperparameters for tuning:
    • C = inverse of regularization strength made to vary between 0.1 and 1.0
    • penalty = penalty norm made to vary between l1 and l2
    • solver = algorithm used in the optimization problem made to vary between liblinear and saga
    • n_estimators = number of base estimators in the ensemble made to vary between 100 and 200
  3. A special hyperparameter (class_weight = balanced) was fixed to address the minimal 2:1 class imbalance observed between the No and Yes Recurred categories.
  4. Hyperparameter tuning was conducted using the 5-cycle 5-fold cross-validation method with optimal model performance using the F1 score determined for:
    • C = 1.0
    • penalty = l1
    • solver = liblinear
    • n_estimators = 200
  5. The apparent model performance of the optimal model is summarized as follows:
    • Accuracy = 0.8921
    • Precision = 0.7746
    • Recall = 0.9016
    • F1 Score = 0.8333
    • AUROC = 0.8948
  6. The independent validation model performance of the optimal model is summarized as follows:
    • Accuracy = 0.8985
    • Precision = 0.7826
    • Recall = 0.9000
    • F1 Score = 0.8372
    • AUROC = 0.8989
  7. Sufficiently comparable apparent and independent validation model performance observed that might be indicative of the absence of excessive model overfitting.
In [174]:
##################################
# Defining the categorical preprocessing parameters
##################################
categorical_features = ['Gender','Smoking','Physical_Examination','Adenopathy','Focality','Risk','T','Stage','Response']
categorical_transformer = OrdinalEncoder()
categorical_preprocessor = ColumnTransformer(transformers=[
    ('cat', categorical_transformer, categorical_features)],
                                             remainder='passthrough',
                                             force_int_remainder_cols=False)
In [175]:
##################################
# Defining the preprocessing and modeling pipeline parameters
##################################
bagged_blr_pipeline = Pipeline([
    ('categorical_preprocessor', categorical_preprocessor),
    ('bagged_blr_model', BaggingClassifier(estimator=LogisticRegression(class_weight='balanced', 
                                                                        random_state=987654321),
                                           random_state=987654321))
])
In [176]:
##################################
# Defining hyperparameter grid
##################################
bagged_blr_hyperparameter_grid = {
    'bagged_blr_model__estimator__C': [0.1, 1.0],
    'bagged_blr_model__estimator__penalty': ['l1', 'l2'],
    'bagged_blr_model__estimator__solver': ['liblinear', 'saga'],
    'bagged_blr_model__n_estimators': [100, 200]
}
In [177]:
##################################
# Defining the cross-validation strategy (5-cycle 5-fold CV)
##################################
cv_strategy = RepeatedStratifiedKFold(n_splits=5, 
                                      n_repeats=5, 
                                      random_state=987654321)
In [178]:
##################################
# Performing Grid Search with cross-validation
##################################
bagged_blr_grid_search = GridSearchCV(
    estimator=bagged_blr_pipeline,
    param_grid=bagged_blr_hyperparameter_grid,
    scoring='f1',
    cv=cv_strategy,
    n_jobs=-1,
    verbose=1
)
In [179]:
##################################
# Encoding the response variables
# for model evaluation
##################################
y_encoder = OrdinalEncoder()
y_encoder.fit(y_preprocessed_train.values.reshape(-1, 1))
y_preprocessed_train_encoded = y_encoder.transform(y_preprocessed_train.values.reshape(-1, 1)).ravel()
y_preprocessed_validation_encoded = y_encoder.transform(y_preprocessed_validation.values.reshape(-1, 1)).ravel()
In [180]:
##################################
# Fitting GridSearchCV
##################################
bagged_blr_grid_search.fit(X_preprocessed_train, y_preprocessed_train_encoded)
Fitting 25 folds for each of 16 candidates, totalling 400 fits
Out[180]:
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])...
                                        BaggingClassifier(estimator=LogisticRegression(class_weight='balanced',
                                                                                       random_state=987654321),
                                                          random_state=987654321))]),
             n_jobs=-1,
             param_grid={'bagged_blr_model__estimator__C': [0.1, 1.0],
                         'bagged_blr_model__estimator__penalty': ['l1', 'l2'],
                         'bagged_blr_model__estimator__solver': ['liblinear',
                                                                 'saga'],
                         'bagged_blr_model__n_estimators': [100, 200]},
             scoring='f1', verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])...
                                        BaggingClassifier(estimator=LogisticRegression(class_weight='balanced',
                                                                                       random_state=987654321),
                                                          random_state=987654321))]),
             n_jobs=-1,
             param_grid={'bagged_blr_model__estimator__C': [0.1, 1.0],
                         'bagged_blr_model__estimator__penalty': ['l1', 'l2'],
                         'bagged_blr_model__estimator__solver': ['liblinear',
                                                                 'saga'],
                         'bagged_blr_model__n_estimators': [100, 200]},
             scoring='f1', verbose=1)
Pipeline(steps=[('categorical_preprocessor',
                 ColumnTransformer(force_int_remainder_cols=False,
                                   remainder='passthrough',
                                   transformers=[('cat', OrdinalEncoder(),
                                                  ['Gender', 'Smoking',
                                                   'Physical_Examination',
                                                   'Adenopathy', 'Focality',
                                                   'Risk', 'T', 'Stage',
                                                   'Response'])])),
                ('bagged_blr_model',
                 BaggingClassifier(estimator=LogisticRegression(class_weight='balanced',
                                                                penalty='l1',
                                                                random_state=987654321,
                                                                solver='liblinear'),
                                   n_estimators=200, random_state=987654321))])
ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough',
                  transformers=[('cat', OrdinalEncoder(),
                                 ['Gender', 'Smoking', 'Physical_Examination',
                                  'Adenopathy', 'Focality', 'Risk', 'T',
                                  'Stage', 'Response'])])
['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response']
OrdinalEncoder()
['Age']
passthrough
BaggingClassifier(estimator=LogisticRegression(class_weight='balanced',
                                               penalty='l1',
                                               random_state=987654321,
                                               solver='liblinear'),
                  n_estimators=200, random_state=987654321)
LogisticRegression(class_weight='balanced', penalty='l1',
                   random_state=987654321, solver='liblinear')
LogisticRegression(class_weight='balanced', penalty='l1',
                   random_state=987654321, solver='liblinear')
In [181]:
##################################
# Identifying the best model
##################################
bagged_blr_optimal = bagged_blr_grid_search.best_estimator_
In [182]:
##################################
# Evaluating the F1 scores
# on the training, cross-validation, and validation data
##################################
bagged_blr_optimal_f1_cv = bagged_blr_grid_search.best_score_
bagged_blr_optimal_f1_train = f1_score(y_preprocessed_train_encoded, bagged_blr_optimal.predict(X_preprocessed_train))
bagged_blr_optimal_f1_validation = f1_score(y_preprocessed_validation_encoded, bagged_blr_optimal.predict(X_preprocessed_validation))
In [183]:
##################################
# Identifying the optimal model
##################################
print('Best Bagged Model – Bagged Logistic Regression: ')
print(f"Best Bagged Logistic Regression Hyperparameters: {bagged_blr_grid_search.best_params_}")
Best Bagged Model – Bagged Logistic Regression: 
Best Bagged Logistic Regression Hyperparameters: {'bagged_blr_model__estimator__C': 1.0, 'bagged_blr_model__estimator__penalty': 'l1', 'bagged_blr_model__estimator__solver': 'liblinear', 'bagged_blr_model__n_estimators': 200}
In [184]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the training and cross-validated data
# to assess overfitting optimism
##################################
print(f"F1 Score on Cross-Validated Data: {bagged_blr_optimal_f1_cv:.4f}")
print(f"F1 Score on Training Data: {bagged_blr_optimal_f1_train:.4f}")
print("\nClassification Report on Train Data:\n", classification_report(y_preprocessed_train_encoded, bagged_blr_optimal.predict(X_preprocessed_train)))
F1 Score on Cross-Validated Data: 0.8213
F1 Score on Training Data: 0.8333

Classification Report on Train Data:
               precision    recall  f1-score   support

         0.0       0.95      0.89      0.92       143
         1.0       0.77      0.90      0.83        61

    accuracy                           0.89       204
   macro avg       0.86      0.89      0.88       204
weighted avg       0.90      0.89      0.89       204

In [185]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the train data
##################################
cm_raw = confusion_matrix(y_preprocessed_train_encoded, bagged_blr_optimal.predict(X_preprocessed_train))
cm_normalized = confusion_matrix(y_preprocessed_train_encoded, bagged_blr_optimal.predict(X_preprocessed_train), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Bagged Logistic Regression Train Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Bagged Logistic Regression Train Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [186]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
# to assess overfitting optimism
##################################
print(f"F1 Score on Validation Data: {bagged_blr_optimal_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_preprocessed_validation_encoded, bagged_blr_optimal.predict(X_preprocessed_validation)))
F1 Score on Validation Data: 0.8372

Classification Report on Validation Data:
               precision    recall  f1-score   support

         0.0       0.96      0.90      0.93        49
         1.0       0.78      0.90      0.84        20

    accuracy                           0.90        69
   macro avg       0.87      0.90      0.88        69
weighted avg       0.91      0.90      0.90        69

In [187]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_preprocessed_validation_encoded, bagged_blr_optimal.predict(X_preprocessed_validation))
cm_normalized = confusion_matrix(y_preprocessed_validation_encoded, bagged_blr_optimal.predict(X_preprocessed_validation), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Bagged Logistic Regression Validation Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Bagged Logistic Regression Validation Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [188]:
##################################
# Gathering the model evaluation metrics
# for the train data
##################################
bagged_blr_optimal_train = model_performance_evaluation(y_preprocessed_train_encoded, bagged_blr_optimal.predict(X_preprocessed_train))
bagged_blr_optimal_train['model'] = ['bagged_blr_optimal'] * 5
bagged_blr_optimal_train['set'] = ['train'] * 5
print('Optimal Bagged Logistic Regression Train Performance Metrics: ')
display(bagged_blr_optimal_train)
Optimal Bagged Logistic Regression Train Performance Metrics: 
metric_name metric_value model set
0 Accuracy 0.892157 bagged_blr_optimal train
1 Precision 0.774648 bagged_blr_optimal train
2 Recall 0.901639 bagged_blr_optimal train
3 F1 0.833333 bagged_blr_optimal train
4 AUROC 0.894876 bagged_blr_optimal train
In [189]:
##################################
# Gathering the model evaluation metrics
# for the validation data
##################################
bagged_blr_optimal_validation = model_performance_evaluation(y_preprocessed_validation_encoded, bagged_blr_optimal.predict(X_preprocessed_validation))
bagged_blr_optimal_validation['model'] = ['bagged_blr_optimal'] * 5
bagged_blr_optimal_validation['set'] = ['validation'] * 5
print('Optimal Bagged Logistic Regression Validation Performance Metrics: ')
display(bagged_blr_optimal_validation)
Optimal Bagged Logistic Regression Validation Performance Metrics: 
metric_name metric_value model set
0 Accuracy 0.898551 bagged_blr_optimal validation
1 Precision 0.782609 bagged_blr_optimal validation
2 Recall 0.900000 bagged_blr_optimal validation
3 F1 0.837209 bagged_blr_optimal validation
4 AUROC 0.898980 bagged_blr_optimal validation
In [190]:
##################################
# Saving the best individual model
# developed from the train data
################################## 
joblib.dump(bagged_blr_optimal, 
            os.path.join("..", MODELS_PATH, "bagged_model_bagged_logistic_regression_optimal.pkl"))
Out[190]:
['..\\models\\bagged_model_bagged_logistic_regression_optimal.pkl']

1.7.5 Bagged Support Vector Machine ¶

Bagged Support Vector Machine is an ensemble method that applies bagging to multiple SVM classifiers trained on different bootstrap samples, reducing variance while maintaining SVM's strong classification capabilities. SVM works by finding an optimal decision boundary (hyperplane) that maximizes the margin between different classes. However, a single SVM can be sensitive to small changes in data, especially when working with noisy datasets. By training multiple SVM models on different subsets and aggregating their predictions (majority voting), bagging stabilizes the decision boundary and enhances robustness. This approach is particularly useful when dealing with high-dimensional datasets with complex relationships. The key advantages include improved generalization, reduced overfitting, and better handling of noisy data. However, SVM is computationally intensive, and bagging increases the overall training time significantly, especially for large datasets. Additionally, combining multiple SVM models makes interpretation difficult, and performance gains may not always justify the added computational cost.

  1. The bagging classifier and support vector machine models from the sklearn.ensemble and sklearn.svm Python library APIs were implemented.
  2. The model contains 4 hyperparameters for tuning:
    • C = inverse of regularization strength made to vary between 0.1 and 1.0
    • kernel = kernel type to be used in the algorithm made to vary between linear and rbf
    • gamma = kernel coefficient made to vary between scale and auto
    • n_estimators = number of base estimators in the ensemble made to vary between 100 and 200
  3. A special hyperparameter (class_weight = balanced) was fixed to address the minimal 2:1 class imbalance observed between the No and Yes Recurred categories.
  4. Hyperparameter tuning was conducted using the 5-cycle 5-fold cross-validation method with optimal model performance using the F1 score determined for:
    • C = 1.0
    • kernel = linear
    • gamma = scale
    • n_estimators = 100
  5. The apparent model performance of the optimal model is summarized as follows:
    • Accuracy = 0.9068
    • Precision = 0.8088
    • Recall = 0.9016
    • F1 Score = 0.8527
    • AUROC = 0.9053
  6. The independent validation model performance of the optimal model is summarized as follows:
    • Accuracy = 0.9130
    • Precision = 0.8181
    • Recall = 0.9000
    • F1 Score = 0.8571
    • AUROC = 0.9091
  7. Sufficiently comparable apparent and independent validation model performance observed that might be indicative of the absence of excessive model overfitting.
In [191]:
##################################
# Defining the categorical preprocessing parameters
##################################
categorical_features = ['Gender','Smoking','Physical_Examination','Adenopathy','Focality','Risk','T','Stage','Response']
categorical_transformer = OrdinalEncoder()
categorical_preprocessor = ColumnTransformer(transformers=[
    ('cat', categorical_transformer, categorical_features)],
                                             remainder='passthrough',
                                             force_int_remainder_cols=False)
In [192]:
##################################
# Defining the preprocessing and modeling pipeline parameters
##################################
bagged_bsvm_pipeline = Pipeline([
    ('categorical_preprocessor', categorical_preprocessor),
    ('bagged_bsvm_model', BaggingClassifier(estimator=SVC(class_weight='balanced', 
                                                          random_state=987654321),
                                            random_state=987654321))
])
In [193]:
##################################
# Defining hyperparameter grid
##################################
bagged_bsvm_hyperparameter_grid = {
    'bagged_bsvm_model__estimator__C': [0.1, 1.0],
    'bagged_bsvm_model__estimator__kernel': ['linear', 'rbf'],
    'bagged_bsvm_model__estimator__gamma': ['scale','auto'],
    'bagged_bsvm_model__n_estimators': [100, 200]
}
In [194]:
##################################
# Defining the cross-validation strategy (5-cycle 5-fold CV)
##################################
cv_strategy = RepeatedStratifiedKFold(n_splits=5, 
                                      n_repeats=5, 
                                      random_state=987654321)
In [195]:
##################################
# Performing Grid Search with cross-validation
##################################
bagged_bsvm_grid_search = GridSearchCV(
    estimator=bagged_bsvm_pipeline,
    param_grid=bagged_bsvm_hyperparameter_grid,
    scoring='f1',
    cv=cv_strategy,
    n_jobs=-1,
    verbose=1
)
In [196]:
##################################
# Encoding the response variables
# for model evaluation
##################################
y_encoder = OrdinalEncoder()
y_encoder.fit(y_preprocessed_train.values.reshape(-1, 1))
y_preprocessed_train_encoded = y_encoder.transform(y_preprocessed_train.values.reshape(-1, 1)).ravel()
y_preprocessed_validation_encoded = y_encoder.transform(y_preprocessed_validation.values.reshape(-1, 1)).ravel()
In [197]:
##################################
# Fitting GridSearchCV
##################################
bagged_bsvm_grid_search.fit(X_preprocessed_train, y_preprocessed_train_encoded)
Fitting 25 folds for each of 16 candidates, totalling 400 fits
Out[197]:
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])...
                                        BaggingClassifier(estimator=SVC(class_weight='balanced',
                                                                        random_state=987654321),
                                                          random_state=987654321))]),
             n_jobs=-1,
             param_grid={'bagged_bsvm_model__estimator__C': [0.1, 1.0],
                         'bagged_bsvm_model__estimator__gamma': ['scale',
                                                                 'auto'],
                         'bagged_bsvm_model__estimator__kernel': ['linear',
                                                                  'rbf'],
                         'bagged_bsvm_model__n_estimators': [100, 200]},
             scoring='f1', verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])...
                                        BaggingClassifier(estimator=SVC(class_weight='balanced',
                                                                        random_state=987654321),
                                                          random_state=987654321))]),
             n_jobs=-1,
             param_grid={'bagged_bsvm_model__estimator__C': [0.1, 1.0],
                         'bagged_bsvm_model__estimator__gamma': ['scale',
                                                                 'auto'],
                         'bagged_bsvm_model__estimator__kernel': ['linear',
                                                                  'rbf'],
                         'bagged_bsvm_model__n_estimators': [100, 200]},
             scoring='f1', verbose=1)
Pipeline(steps=[('categorical_preprocessor',
                 ColumnTransformer(force_int_remainder_cols=False,
                                   remainder='passthrough',
                                   transformers=[('cat', OrdinalEncoder(),
                                                  ['Gender', 'Smoking',
                                                   'Physical_Examination',
                                                   'Adenopathy', 'Focality',
                                                   'Risk', 'T', 'Stage',
                                                   'Response'])])),
                ('bagged_bsvm_model',
                 BaggingClassifier(estimator=SVC(class_weight='balanced',
                                                 kernel='linear',
                                                 random_state=987654321),
                                   n_estimators=100, random_state=987654321))])
ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough',
                  transformers=[('cat', OrdinalEncoder(),
                                 ['Gender', 'Smoking', 'Physical_Examination',
                                  'Adenopathy', 'Focality', 'Risk', 'T',
                                  'Stage', 'Response'])])
['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response']
OrdinalEncoder()
['Age']
passthrough
BaggingClassifier(estimator=SVC(class_weight='balanced', kernel='linear',
                                random_state=987654321),
                  n_estimators=100, random_state=987654321)
SVC(class_weight='balanced', kernel='linear', random_state=987654321)
SVC(class_weight='balanced', kernel='linear', random_state=987654321)
In [198]:
##################################
# Identifying the best model
##################################
bagged_bsvm_optimal = bagged_bsvm_grid_search.best_estimator_
In [199]:
##################################
# Evaluating the F1 scores
# on the training, cross-validation, and validation data
##################################
bagged_bsvm_optimal_f1_cv = bagged_bsvm_grid_search.best_score_
bagged_bsvm_optimal_f1_train = f1_score(y_preprocessed_train_encoded, bagged_bsvm_optimal.predict(X_preprocessed_train))
bagged_bsvm_optimal_f1_validation = f1_score(y_preprocessed_validation_encoded, bagged_bsvm_optimal.predict(X_preprocessed_validation))
In [200]:
##################################
# Identifying the optimal model
##################################
print('Best Bagged Model – Bagged Support Vector Machine: ')
print(f"Best Bagged Support Vector Machine Hyperparameters: {bagged_bsvm_grid_search.best_params_}")
Best Bagged Model – Bagged Support Vector Machine: 
Best Bagged Support Vector Machine Hyperparameters: {'bagged_bsvm_model__estimator__C': 1.0, 'bagged_bsvm_model__estimator__gamma': 'scale', 'bagged_bsvm_model__estimator__kernel': 'linear', 'bagged_bsvm_model__n_estimators': 100}
In [201]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the training and cross-validated data
# to assess overfitting optimism
##################################
print(f"F1 Score on Cross-Validated Data: {bagged_bsvm_optimal_f1_cv:.4f}")
print(f"F1 Score on Training Data: {bagged_bsvm_optimal_f1_train:.4f}")
print("\nClassification Report on Train Data:\n", classification_report(y_preprocessed_train_encoded, bagged_bsvm_optimal.predict(X_preprocessed_train)))
F1 Score on Cross-Validated Data: 0.8209
F1 Score on Training Data: 0.8527

Classification Report on Train Data:
               precision    recall  f1-score   support

         0.0       0.96      0.91      0.93       143
         1.0       0.81      0.90      0.85        61

    accuracy                           0.91       204
   macro avg       0.88      0.91      0.89       204
weighted avg       0.91      0.91      0.91       204

In [202]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the train data
##################################
cm_raw = confusion_matrix(y_preprocessed_train_encoded, bagged_bsvm_optimal.predict(X_preprocessed_train))
cm_normalized = confusion_matrix(y_preprocessed_train_encoded, bagged_bsvm_optimal.predict(X_preprocessed_train), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Bagged Support Vector Machine Train Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Bagged Support Vector Machine Train Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [203]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
# to assess overfitting optimism
##################################
print(f"F1 Score on Validation Data: {bagged_bsvm_optimal_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_preprocessed_validation_encoded, bagged_bsvm_optimal.predict(X_preprocessed_validation)))
F1 Score on Validation Data: 0.8571

Classification Report on Validation Data:
               precision    recall  f1-score   support

         0.0       0.96      0.92      0.94        49
         1.0       0.82      0.90      0.86        20

    accuracy                           0.91        69
   macro avg       0.89      0.91      0.90        69
weighted avg       0.92      0.91      0.91        69

In [204]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_preprocessed_validation_encoded, bagged_bsvm_optimal.predict(X_preprocessed_validation))
cm_normalized = confusion_matrix(y_preprocessed_validation_encoded, bagged_bsvm_optimal.predict(X_preprocessed_validation), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Bagged Support Vector Machine Validation Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Bagged Support Vector Machine Validation Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [205]:
##################################
# Gathering the model evaluation metrics
# for the train data
##################################
bagged_bsvm_optimal_train = model_performance_evaluation(y_preprocessed_train_encoded, bagged_bsvm_optimal.predict(X_preprocessed_train))
bagged_bsvm_optimal_train['model'] = ['bagged_bsvm_optimal'] * 5
bagged_bsvm_optimal_train['set'] = ['train'] * 5
print('Optimal Bagged Support Vector Machine Train Performance Metrics: ')
display(bagged_bsvm_optimal_train)
Optimal Bagged Support Vector Machine Train Performance Metrics: 
metric_name metric_value model set
0 Accuracy 0.906863 bagged_bsvm_optimal train
1 Precision 0.808824 bagged_bsvm_optimal train
2 Recall 0.901639 bagged_bsvm_optimal train
3 F1 0.852713 bagged_bsvm_optimal train
4 AUROC 0.905365 bagged_bsvm_optimal train
In [206]:
##################################
# Gathering the model evaluation metrics
# for the validation data
##################################
bagged_bsvm_optimal_validation = model_performance_evaluation(y_preprocessed_validation_encoded, bagged_bsvm_optimal.predict(X_preprocessed_validation))
bagged_bsvm_optimal_validation['model'] = ['bagged_bsvm_optimal'] * 5
bagged_bsvm_optimal_validation['set'] = ['validation'] * 5
print('Optimal Bagged Support Vector Machine Validation Performance Metrics: ')
display(bagged_bsvm_optimal_validation)
Optimal Bagged Support Vector Machine Validation Performance Metrics: 
metric_name metric_value model set
0 Accuracy 0.913043 bagged_bsvm_optimal validation
1 Precision 0.818182 bagged_bsvm_optimal validation
2 Recall 0.900000 bagged_bsvm_optimal validation
3 F1 0.857143 bagged_bsvm_optimal validation
4 AUROC 0.909184 bagged_bsvm_optimal validation
In [207]:
##################################
# Saving the best individual model
# developed from the train data
################################## 
joblib.dump(bagged_bsvm_optimal, 
            os.path.join("..", MODELS_PATH, "bagged_model_bagged_svm_optimal.pkl"))
Out[207]:
['..\\models\\bagged_model_bagged_svm_optimal.pkl']

1.8. Boosted Model Development ¶

Boosting is an ensemble learning method that builds a strong classifier by training models sequentially, where each new model focuses on correcting the mistakes of its predecessors. Boosting assigns higher weights to misclassified instances, ensuring that subsequent models pay more attention to these hard-to-classify cases. The motivation behind boosting is to reduce both bias and variance by iteratively refining weak learners — models that perform only slightly better than random guessing — until they collectively form a strong classifier. In classification tasks, predictions are refined by combining weighted outputs of multiple weak models, typically decision stumps or shallow trees. This makes boosting highly effective in uncovering complex patterns in data. However, the sequential nature of boosting makes it computationally expensive compared to bagging, and it is more prone to overfitting if the number of weak learners is too high.

1.8.1 AdaBoost ¶

AdaBoost (Adaptive Boosting) is a boosting technique that combines multiple weak learners — typically decision stumps (shallow trees) — to form a strong classifier. It works by iteratively training weak models, assigning higher weights to misclassified instances so that subsequent models focus on difficult cases. At each iteration, a new weak model is trained, and its predictions are combined using a weighted voting mechanism. This process continues until a stopping criterion is met, such as a predefined number of iterations or performance threshold. AdaBoost is advantageous because it improves accuracy without overfitting if regularized properly. It performs well with clean data and can transform weak classifiers into strong ones. However, it is sensitive to noisy data and outliers, as misclassified points receive higher importance, leading to potential overfitting. Additionally, training can be slow for large datasets, and performance depends on the choice of base learner, typically decision trees.

  1. The adaboost model from the sklearn.ensemble Python library API was implemented.
  2. The model contains 3 hyperparameters for tuning:
    • estimator_max_depth = maximum depth of the tree made to vary between 3 and 6
    • learning_rate = weight applied to each classifier at each boosting iteration made to vary between 0.01 and 0.10
    • n_estimators = maximum number of estimators at which boosting is terminated made to vary between 50 and 100
  3. No any hyperparameter was defined in the model to address the minimal 2:1 class imbalance observed between the No and Yes Recurred categories.
  4. Hyperparameter tuning was conducted using the 5-cycle 5-fold cross-validation method with optimal model performance using the F1 score determined for:
    • estimator_max_depth = 2
    • learning_rate = 0.01
    • n_estimators = 50
  5. The apparent model performance of the optimal model is summarized as follows:
    • Accuracy = 0.9019
    • Precision = 0.8059
    • Recall = 0.8852
    • F1 Score = 0.8437
    • AUROC = 0.8971
  6. The independent validation model performance of the optimal model is summarized as follows:
    • Accuracy = 0.9130
    • Precision = 0.8181
    • Recall = 0.9000
    • F1 Score = 0.8571
    • AUROC = 0.9091
  7. Sufficiently comparable apparent and independent validation model performance observed that might be indicative of the absence of excessive model overfitting.
In [208]:
##################################
# Defining the categorical preprocessing parameters
##################################
categorical_features = ['Gender','Smoking','Physical_Examination','Adenopathy','Focality','Risk','T','Stage','Response']
categorical_transformer = OrdinalEncoder()
categorical_preprocessor = ColumnTransformer(transformers=[
    ('cat', categorical_transformer, categorical_features)],
                                             remainder='passthrough',
                                             force_int_remainder_cols=False)
In [209]:
##################################
# Defining the preprocessing and modeling pipeline parameters
##################################
boosted_ab_pipeline = Pipeline([
    ('categorical_preprocessor', categorical_preprocessor),
    ('boosted_ab_model', AdaBoostClassifier(estimator=DecisionTreeClassifier(random_state=987654321),
                                            random_state=987654321))
])
In [210]:
##################################
# Defining hyperparameter grid
##################################
boosted_ab_hyperparameter_grid = {
    'boosted_ab_model__learning_rate': [0.01, 0.10],  
    'boosted_ab_model__estimator__max_depth': [1, 2],
    'boosted_ab_model__n_estimators': [50, 100] 
}
In [211]:
##################################
# Defining the cross-validation strategy (5-cycle 5-fold CV)
##################################
cv_strategy = RepeatedStratifiedKFold(n_splits=5, 
                                      n_repeats=5, 
                                      random_state=987654321)
In [212]:
##################################
# Performing Grid Search with cross-validation
##################################
boosted_ab_grid_search = GridSearchCV(
    estimator=boosted_ab_pipeline,
    param_grid=boosted_ab_hyperparameter_grid,
    scoring='f1',
    cv=cv_strategy,
    n_jobs=-1,
    verbose=1
)
In [213]:
##################################
# Encoding the response variables
# for model evaluation
##################################
y_encoder = OrdinalEncoder()
y_encoder.fit(y_preprocessed_train.values.reshape(-1, 1))
y_preprocessed_train_encoded = y_encoder.transform(y_preprocessed_train.values.reshape(-1, 1)).ravel()
y_preprocessed_validation_encoded = y_encoder.transform(y_preprocessed_validation.values.reshape(-1, 1)).ravel()
In [214]:
##################################
# Fitting GridSearchCV
##################################
boosted_ab_grid_search.fit(X_preprocessed_train, y_preprocessed_train_encoded)
Fitting 25 folds for each of 8 candidates, totalling 200 fits
Out[214]:
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])])),
                                       ('boosted_ab_model',
                                        AdaBoostClassifier(estimator=DecisionTreeClassifier(random_state=987654321),
                                                           random_state=987654321))]),
             n_jobs=-1,
             param_grid={'boosted_ab_model__estimator__max_depth': [1, 2],
                         'boosted_ab_model__learning_rate': [0.01, 0.1],
                         'boosted_ab_model__n_estimators': [50, 100]},
             scoring='f1', verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])])),
                                       ('boosted_ab_model',
                                        AdaBoostClassifier(estimator=DecisionTreeClassifier(random_state=987654321),
                                                           random_state=987654321))]),
             n_jobs=-1,
             param_grid={'boosted_ab_model__estimator__max_depth': [1, 2],
                         'boosted_ab_model__learning_rate': [0.01, 0.1],
                         'boosted_ab_model__n_estimators': [50, 100]},
             scoring='f1', verbose=1)
Pipeline(steps=[('categorical_preprocessor',
                 ColumnTransformer(force_int_remainder_cols=False,
                                   remainder='passthrough',
                                   transformers=[('cat', OrdinalEncoder(),
                                                  ['Gender', 'Smoking',
                                                   'Physical_Examination',
                                                   'Adenopathy', 'Focality',
                                                   'Risk', 'T', 'Stage',
                                                   'Response'])])),
                ('boosted_ab_model',
                 AdaBoostClassifier(estimator=DecisionTreeClassifier(max_depth=2,
                                                                     random_state=987654321),
                                    learning_rate=0.01,
                                    random_state=987654321))])
ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough',
                  transformers=[('cat', OrdinalEncoder(),
                                 ['Gender', 'Smoking', 'Physical_Examination',
                                  'Adenopathy', 'Focality', 'Risk', 'T',
                                  'Stage', 'Response'])])
['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response']
OrdinalEncoder()
['Age']
passthrough
AdaBoostClassifier(estimator=DecisionTreeClassifier(max_depth=2,
                                                    random_state=987654321),
                   learning_rate=0.01, random_state=987654321)
DecisionTreeClassifier(max_depth=2, random_state=987654321)
DecisionTreeClassifier(max_depth=2, random_state=987654321)
In [215]:
##################################
# Identifying the best model
##################################
boosted_ab_optimal = boosted_ab_grid_search.best_estimator_
In [216]:
##################################
# Evaluating the F1 scores
# on the training, cross-validation, and validation data
##################################
boosted_ab_optimal_f1_cv = boosted_ab_grid_search.best_score_
boosted_ab_optimal_f1_train = f1_score(y_preprocessed_train_encoded, boosted_ab_optimal.predict(X_preprocessed_train))
boosted_ab_optimal_f1_validation = f1_score(y_preprocessed_validation_encoded, boosted_ab_optimal.predict(X_preprocessed_validation))
In [217]:
##################################
# Identifying the optimal model
##################################
print('Best Boosted Model - AdaBoost: ')
print(f"Best AdaBoost Hyperparameters: {boosted_ab_grid_search.best_params_}")
Best Boosted Model - AdaBoost: 
Best AdaBoost Hyperparameters: {'boosted_ab_model__estimator__max_depth': 2, 'boosted_ab_model__learning_rate': 0.01, 'boosted_ab_model__n_estimators': 50}
In [218]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the training and cross-validated data
# to assess overfitting optimism
##################################
print(f"F1 Score on Cross-Validated Data: {boosted_ab_optimal_f1_cv:.4f}")
print(f"F1 Score on Training Data: {boosted_ab_optimal_f1_train:.4f}")
print("\nClassification Report on Train Data:\n", classification_report(y_preprocessed_train_encoded, boosted_ab_optimal.predict(X_preprocessed_train)))
F1 Score on Cross-Validated Data: 0.8364
F1 Score on Training Data: 0.8438

Classification Report on Train Data:
               precision    recall  f1-score   support

         0.0       0.95      0.91      0.93       143
         1.0       0.81      0.89      0.84        61

    accuracy                           0.90       204
   macro avg       0.88      0.90      0.89       204
weighted avg       0.91      0.90      0.90       204

In [219]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the train data
##################################
cm_raw = confusion_matrix(y_preprocessed_train_encoded, boosted_ab_optimal.predict(X_preprocessed_train))
cm_normalized = confusion_matrix(y_preprocessed_train_encoded, boosted_ab_optimal.predict(X_preprocessed_train), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal AdaBoost Train Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal AdaBoost Train Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [220]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
# to assess overfitting optimism
##################################
print(f"F1 Score on Validation Data: {boosted_ab_optimal_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_preprocessed_validation_encoded, boosted_ab_optimal.predict(X_preprocessed_validation)))
F1 Score on Validation Data: 0.8571

Classification Report on Validation Data:
               precision    recall  f1-score   support

         0.0       0.96      0.92      0.94        49
         1.0       0.82      0.90      0.86        20

    accuracy                           0.91        69
   macro avg       0.89      0.91      0.90        69
weighted avg       0.92      0.91      0.91        69

In [221]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_preprocessed_validation_encoded, boosted_ab_optimal.predict(X_preprocessed_validation))
cm_normalized = confusion_matrix(y_preprocessed_validation_encoded, boosted_ab_optimal.predict(X_preprocessed_validation), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal AdaBoost Validation Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal AdaBoost Validation Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [222]:
##################################
# Gathering the model evaluation metrics
# for the train data
##################################
boosted_ab_optimal_train = model_performance_evaluation(y_preprocessed_train_encoded, boosted_ab_optimal.predict(X_preprocessed_train))
boosted_ab_optimal_train['model'] = ['boosted_ab_optimal'] * 5
boosted_ab_optimal_train['set'] = ['train'] * 5
print('Optimal AdaBoost Train Performance Metrics: ')
display(boosted_ab_optimal_train)
Optimal AdaBoost Train Performance Metrics: 
metric_name metric_value model set
0 Accuracy 0.901961 boosted_ab_optimal train
1 Precision 0.805970 boosted_ab_optimal train
2 Recall 0.885246 boosted_ab_optimal train
3 F1 0.843750 boosted_ab_optimal train
4 AUROC 0.897168 boosted_ab_optimal train
In [223]:
##################################
# Gathering the model evaluation metrics
# for the validation data
##################################
boosted_ab_optimal_validation = model_performance_evaluation(y_preprocessed_validation_encoded, boosted_ab_optimal.predict(X_preprocessed_validation))
boosted_ab_optimal_validation['model'] = ['boosted_ab_optimal'] * 5
boosted_ab_optimal_validation['set'] = ['validation'] * 5
print('Optimal AdaBoost Validation Performance Metrics: ')
display(boosted_ab_optimal_validation)
Optimal AdaBoost Validation Performance Metrics: 
metric_name metric_value model set
0 Accuracy 0.913043 boosted_ab_optimal validation
1 Precision 0.818182 boosted_ab_optimal validation
2 Recall 0.900000 boosted_ab_optimal validation
3 F1 0.857143 boosted_ab_optimal validation
4 AUROC 0.909184 boosted_ab_optimal validation
In [224]:
##################################
# Saving the best individual model
# developed from the train data
################################## 
joblib.dump(boosted_ab_optimal, 
            os.path.join("..", MODELS_PATH, "boosted_model_adaboost_optimal.pkl"))
Out[224]:
['..\\models\\boosted_model_adaboost_optimal.pkl']

1.8.2 Gradient Boosting ¶

Gradient Boosting builds an ensemble of decision trees sequentially, where each new tree corrects the mistakes of the previous ones by optimizing a loss function. Unlike AdaBoost, which reweights misclassified instances, Gradient Boosting fits each new tree to the residual errors of the previous model, gradually improving predictions. This process continues until a stopping criterion, such as a set number of trees, is met. The key advantages of Gradient Boosting include its flexibility to model complex relationships and strong predictive performance, often outperforming bagging methods. It can handle both numeric and categorical data well. However, it is prone to overfitting if not carefully tuned, especially with deep trees and too many iterations. It is also computationally expensive due to sequential training, and hyperparameter tuning (e.g., learning rate, number of trees, tree depth) can be challenging and time-consuming.

  1. The gradient boosting model from the sklearn.ensemble Python library API was implemented.
  2. The model contains 4 hyperparameters for tuning:
    • learning_rate = shrinking proportion of the contribution from each tree made to vary between 0.01 and 0.10
    • max_depth = maximum depth of the tree made to vary between 3 and 6
    • min_samples_leaf = minimum number of samples required to be at a leaf node made to vary between 5 and 10
    • n_estimators = number of boosting stages to perform made to vary between 50 and 100
  3. No any hyperparameter was defined in the model to address the minimal 2:1 class imbalance observed between the No and Yes Recurred categories.
  4. Hyperparameter tuning was conducted using the 5-cycle 5-fold cross-validation method with optimal model performance using the F1 score determined for:
    • learning_rate = 0.10
    • max_depth = 3
    • min_samples_leaf = 10
    • n_estimators = 50
  5. The apparent model performance of the optimal model is summarized as follows:
    • Accuracy = 0.9460
    • Precision = 0.9032
    • Recall = 0.9180
    • F1 Score = 0.9105
    • AUROC = 0.9380
  6. The independent validation model performance of the optimal model is summarized as follows:
    • Accuracy = 0.8985
    • Precision = 0.8095
    • Recall = 0.8500
    • F1 Score = 0.8292
    • AUROC = 0.8841
  7. Relatively large difference in apparent and independent validation model performance observed that might be indicative of the presence of moderate model overfitting.
In [225]:
##################################
# Defining the categorical preprocessing parameters
##################################
categorical_features = ['Gender','Smoking','Physical_Examination','Adenopathy','Focality','Risk','T','Stage','Response']
categorical_transformer = OrdinalEncoder()
categorical_preprocessor = ColumnTransformer(transformers=[
    ('cat', categorical_transformer, categorical_features)],
                                             remainder='passthrough',
                                             force_int_remainder_cols=False)
In [226]:
##################################
# Defining the preprocessing and modeling pipeline parameters
##################################
boosted_gb_pipeline = Pipeline([
    ('categorical_preprocessor', categorical_preprocessor),
    ('boosted_gb_model', GradientBoostingClassifier(random_state=987654321))
])
In [227]:
##################################
# Defining hyperparameter grid
##################################
boosted_gb_hyperparameter_grid = {
    'boosted_gb_model__learning_rate': [0.01, 0.10],
    'boosted_gb_model__max_depth': [3, 6], 
    'boosted_gb_model__min_samples_leaf': [5, 10],
    'boosted_gb_model__n_estimators': [50, 100] 
}
In [228]:
##################################
# Defining the cross-validation strategy (5-cycle 5-fold CV)
##################################
cv_strategy = RepeatedStratifiedKFold(n_splits=5, 
                                      n_repeats=5, 
                                      random_state=987654321)
In [229]:
##################################
# Performing Grid Search with cross-validation
##################################
boosted_gb_grid_search = GridSearchCV(
    estimator=boosted_gb_pipeline,
    param_grid=boosted_gb_hyperparameter_grid,
    scoring='f1',
    cv=cv_strategy,
    n_jobs=-1,
    verbose=1
)
In [230]:
##################################
# Encoding the response variables
# for model evaluation
##################################
y_encoder = OrdinalEncoder()
y_encoder.fit(y_preprocessed_train.values.reshape(-1, 1))
y_preprocessed_train_encoded = y_encoder.transform(y_preprocessed_train.values.reshape(-1, 1)).ravel()
y_preprocessed_validation_encoded = y_encoder.transform(y_preprocessed_validation.values.reshape(-1, 1)).ravel()
In [231]:
##################################
# Fitting GridSearchCV
##################################
boosted_gb_grid_search.fit(X_preprocessed_train, y_preprocessed_train_encoded)
Fitting 25 folds for each of 16 candidates, totalling 400 fits
Out[231]:
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])])),
                                       ('boosted_gb_model',
                                        GradientBoostingClassifier(random_state=987654321))]),
             n_jobs=-1,
             param_grid={'boosted_gb_model__learning_rate': [0.01, 0.1],
                         'boosted_gb_model__max_depth': [3, 6],
                         'boosted_gb_model__min_samples_leaf': [5, 10],
                         'boosted_gb_model__n_estimators': [50, 100]},
             scoring='f1', verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])])),
                                       ('boosted_gb_model',
                                        GradientBoostingClassifier(random_state=987654321))]),
             n_jobs=-1,
             param_grid={'boosted_gb_model__learning_rate': [0.01, 0.1],
                         'boosted_gb_model__max_depth': [3, 6],
                         'boosted_gb_model__min_samples_leaf': [5, 10],
                         'boosted_gb_model__n_estimators': [50, 100]},
             scoring='f1', verbose=1)
Pipeline(steps=[('categorical_preprocessor',
                 ColumnTransformer(force_int_remainder_cols=False,
                                   remainder='passthrough',
                                   transformers=[('cat', OrdinalEncoder(),
                                                  ['Gender', 'Smoking',
                                                   'Physical_Examination',
                                                   'Adenopathy', 'Focality',
                                                   'Risk', 'T', 'Stage',
                                                   'Response'])])),
                ('boosted_gb_model',
                 GradientBoostingClassifier(min_samples_leaf=10,
                                            n_estimators=50,
                                            random_state=987654321))])
ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough',
                  transformers=[('cat', OrdinalEncoder(),
                                 ['Gender', 'Smoking', 'Physical_Examination',
                                  'Adenopathy', 'Focality', 'Risk', 'T',
                                  'Stage', 'Response'])])
['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response']
OrdinalEncoder()
['Age']
passthrough
GradientBoostingClassifier(min_samples_leaf=10, n_estimators=50,
                           random_state=987654321)
In [232]:
##################################
# Identifying the best model
##################################
boosted_gb_optimal = boosted_gb_grid_search.best_estimator_
In [233]:
##################################
# Evaluating the F1 scores
# on the training, cross-validation, and validation data
##################################
boosted_gb_optimal_f1_cv = boosted_gb_grid_search.best_score_
boosted_gb_optimal_f1_train = f1_score(y_preprocessed_train_encoded, boosted_gb_optimal.predict(X_preprocessed_train))
boosted_gb_optimal_f1_validation = f1_score(y_preprocessed_validation_encoded, boosted_gb_optimal.predict(X_preprocessed_validation))
In [234]:
##################################
# Identifying the optimal model
##################################
print('Best Boosted Model - Gradient Boosting: ')
print(f"Best Gradient Boosting Hyperparameters: {boosted_gb_grid_search.best_params_}")
Best Boosted Model - Gradient Boosting: 
Best Gradient Boosting Hyperparameters: {'boosted_gb_model__learning_rate': 0.1, 'boosted_gb_model__max_depth': 3, 'boosted_gb_model__min_samples_leaf': 10, 'boosted_gb_model__n_estimators': 50}
In [235]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the training and cross-validated data
# to assess overfitting optimism
##################################
print(f"F1 Score on Cross-Validated Data: {boosted_gb_optimal_f1_cv:.4f}")
print(f"F1 Score on Training Data: {boosted_gb_optimal_f1_train:.4f}")
print("\nClassification Report on Train Data:\n", classification_report(y_preprocessed_train_encoded, boosted_gb_optimal.predict(X_preprocessed_train)))
F1 Score on Cross-Validated Data: 0.8131
F1 Score on Training Data: 0.9106

Classification Report on Train Data:
               precision    recall  f1-score   support

         0.0       0.96      0.96      0.96       143
         1.0       0.90      0.92      0.91        61

    accuracy                           0.95       204
   macro avg       0.93      0.94      0.94       204
weighted avg       0.95      0.95      0.95       204

In [236]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the train data
##################################
cm_raw = confusion_matrix(y_preprocessed_train_encoded, boosted_gb_optimal.predict(X_preprocessed_train))
cm_normalized = confusion_matrix(y_preprocessed_train_encoded, boosted_gb_optimal.predict(X_preprocessed_train), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Gradient Boosting Train Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Gradient Boosting Train Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [237]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
# to assess overfitting optimism
##################################
print(f"F1 Score on Validation Data: {boosted_gb_optimal_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_preprocessed_validation_encoded, boosted_gb_optimal.predict(X_preprocessed_validation)))
F1 Score on Validation Data: 0.8293

Classification Report on Validation Data:
               precision    recall  f1-score   support

         0.0       0.94      0.92      0.93        49
         1.0       0.81      0.85      0.83        20

    accuracy                           0.90        69
   macro avg       0.87      0.88      0.88        69
weighted avg       0.90      0.90      0.90        69

In [238]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_preprocessed_validation_encoded, boosted_gb_optimal.predict(X_preprocessed_validation))
cm_normalized = confusion_matrix(y_preprocessed_validation_encoded, boosted_gb_optimal.predict(X_preprocessed_validation), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Gradient Boosting Validation Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Gradient Boosting Validation Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [239]:
##################################
# Gathering the model evaluation metrics
# for the train data
##################################
boosted_gb_optimal_train = model_performance_evaluation(y_preprocessed_train_encoded, boosted_gb_optimal.predict(X_preprocessed_train))
boosted_gb_optimal_train['model'] = ['boosted_gb_optimal'] * 5
boosted_gb_optimal_train['set'] = ['train'] * 5
print('Optimal Gradient Boosting Train Performance Metrics: ')
display(boosted_gb_optimal_train)
Optimal Gradient Boosting Train Performance Metrics: 
metric_name metric_value model set
0 Accuracy 0.946078 boosted_gb_optimal train
1 Precision 0.903226 boosted_gb_optimal train
2 Recall 0.918033 boosted_gb_optimal train
3 F1 0.910569 boosted_gb_optimal train
4 AUROC 0.938037 boosted_gb_optimal train
In [240]:
##################################
# Gathering the model evaluation metrics
# for the validation data
##################################
boosted_gb_optimal_validation = model_performance_evaluation(y_preprocessed_validation_encoded, boosted_gb_optimal.predict(X_preprocessed_validation))
boosted_gb_optimal_validation['model'] = ['boosted_gb_optimal'] * 5
boosted_gb_optimal_validation['set'] = ['validation'] * 5
print('Optimal Gradient Boosting Validation Performance Metrics: ')
display(boosted_gb_optimal_validation)
Optimal Gradient Boosting Validation Performance Metrics: 
metric_name metric_value model set
0 Accuracy 0.898551 boosted_gb_optimal validation
1 Precision 0.809524 boosted_gb_optimal validation
2 Recall 0.850000 boosted_gb_optimal validation
3 F1 0.829268 boosted_gb_optimal validation
4 AUROC 0.884184 boosted_gb_optimal validation
In [241]:
##################################
# Saving the best individual model
# developed from the train data
################################## 
joblib.dump(boosted_gb_optimal, 
            os.path.join("..", MODELS_PATH, "boosted_model_gradient_boosting_optimal.pkl"))
Out[241]:
['..\\models\\boosted_model_gradient_boosting_optimal.pkl']

1.8.3 XGBoost ¶

XGBoost (Extreme Gradient Boosting) is an optimized version of Gradient Boosting that introduces additional regularization and computational efficiencies. It builds decision trees sequentially, with each new tree correcting the residual errors of the previous ones, but it incorporates advanced techniques such as shrinkage (learning rate), column subsampling, and L1/L2 regularization to prevent overfitting. Additionally, XGBoost employs parallelization, reducing training time significantly compared to standard Gradient Boosting. It is widely used in machine learning competitions due to its superior accuracy and efficiency. The key advantages include its ability to handle missing data, built-in regularization for better generalization, and fast training through parallelization. However, XGBoost requires careful hyperparameter tuning to achieve optimal performance, and the model can become overly complex, making interpretation difficult. It is also memory-intensive, especially for large datasets, and can be challenging to deploy efficiently in real-time applications.

  1. The xgboost model from the xgboost Python library API was implemented.
  2. The model contains 4 hyperparameters for tuning:
    • learning_rate = step size at which weights are updated during training made to vary between 0.01 and 0.10
    • max_depth = maximum depth of the tree made to vary between 3 and 6
    • gamma = minimum loss reduction required to make a further split in a tree made to vary between 0.10 and 0.20
    • n_estimators = number of boosting stages to perform made to vary between 50 and 100
  3. A special hyperparameter (scale_pos_weight = 2.0) was fixed to address the minimal 2:1 class imbalance observed between the No and Yes Recurred categories.
  4. Hyperparameter tuning was conducted using the 5-cycle 5-fold cross-validation method with optimal model performance using the F1 score determined for:
    • learning_rate = 0.01
    • max_depth = 3
    • gamma 0.10
    • n_estimators = 50
  5. The apparent model performance of the optimal model is summarized as follows:
    • Accuracy = 0.9068
    • Precision = 0.8181
    • Recall = 0.8852
    • F1 Score = 0.8503
    • AUROC = 0.9006
  6. The independent validation model performance of the optimal model is summarized as follows:
    • Accuracy = 0.9130
    • Precision = 0.8181
    • Recall = 0.9000
    • F1 Score = 0.8571
    • AUROC = 0.9091
  7. Sufficiently comparable apparent and independent validation model performance observed that might be indicative of the absence of excessive model overfitting.
In [242]:
##################################
# Defining the categorical preprocessing parameters
##################################
categorical_features = ['Gender','Smoking','Physical_Examination','Adenopathy','Focality','Risk','T','Stage','Response']
categorical_transformer = OrdinalEncoder()
categorical_preprocessor = ColumnTransformer(transformers=[
    ('cat', categorical_transformer, categorical_features)],
                                             remainder='passthrough',
                                             force_int_remainder_cols=False)
In [243]:
##################################
# Defining the preprocessing and modeling pipeline parameters
##################################
boosted_xgb_pipeline = Pipeline([
    ('categorical_preprocessor', categorical_preprocessor),
    ('boosted_xgb_model', XGBClassifier(scale_pos_weight=2.0, 
                                        random_state=987654321,
                                        subsample=0.7,
                                        colsample_bytree=0.7,
                                        eval_metric='logloss'))
])
In [244]:
##################################
# Defining hyperparameter grid
##################################
boosted_xgb_hyperparameter_grid = {
    'boosted_xgb_model__learning_rate': [0.01, 0.10],
    'boosted_xgb_model__max_depth': [3, 6], 
    'boosted_xgb_model__gamma': [0.1, 0.2],
    'boosted_xgb_model__n_estimators': [50, 100] 
}
In [245]:
##################################
# Defining the cross-validation strategy (5-cycle 5-fold CV)
##################################
cv_strategy = RepeatedStratifiedKFold(n_splits=5, 
                                      n_repeats=5, 
                                      random_state=987654321)
In [246]:
##################################
# Performing Grid Search with cross-validation
##################################
boosted_xgb_grid_search = GridSearchCV(
    estimator=boosted_xgb_pipeline,
    param_grid=boosted_xgb_hyperparameter_grid,
    scoring='f1',
    cv=cv_strategy,
    n_jobs=-1,
    verbose=1
)
In [247]:
##################################
# Encoding the response variables
# for model evaluation
##################################
y_encoder = OrdinalEncoder()
y_encoder.fit(y_preprocessed_train.values.reshape(-1, 1))
y_preprocessed_train_encoded = y_encoder.transform(y_preprocessed_train.values.reshape(-1, 1)).ravel()
y_preprocessed_validation_encoded = y_encoder.transform(y_preprocessed_validation.values.reshape(-1, 1)).ravel()
In [248]:
##################################
# Fitting GridSearchCV
##################################
boosted_xgb_grid_search.fit(X_preprocessed_train, y_preprocessed_train_encoded)
Fitting 25 folds for each of 16 candidates, totalling 400 fits
Out[248]:
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])...
                                                      missing=nan,
                                                      monotone_constraints=None,
                                                      multi_strategy=None,
                                                      n_estimators=None,
                                                      n_jobs=None,
                                                      num_parallel_tree=None,
                                                      random_state=987654321, ...))]),
             n_jobs=-1,
             param_grid={'boosted_xgb_model__gamma': [0.1, 0.2],
                         'boosted_xgb_model__learning_rate': [0.01, 0.1],
                         'boosted_xgb_model__max_depth': [3, 6],
                         'boosted_xgb_model__n_estimators': [50, 100]},
             scoring='f1', verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])...
                                                      missing=nan,
                                                      monotone_constraints=None,
                                                      multi_strategy=None,
                                                      n_estimators=None,
                                                      n_jobs=None,
                                                      num_parallel_tree=None,
                                                      random_state=987654321, ...))]),
             n_jobs=-1,
             param_grid={'boosted_xgb_model__gamma': [0.1, 0.2],
                         'boosted_xgb_model__learning_rate': [0.01, 0.1],
                         'boosted_xgb_model__max_depth': [3, 6],
                         'boosted_xgb_model__n_estimators': [50, 100]},
             scoring='f1', verbose=1)
Pipeline(steps=[('categorical_preprocessor',
                 ColumnTransformer(force_int_remainder_cols=False,
                                   remainder='passthrough',
                                   transformers=[('cat', OrdinalEncoder(),
                                                  ['Gender', 'Smoking',
                                                   'Physical_Examination',
                                                   'Adenopathy', 'Focality',
                                                   'Risk', 'T', 'Stage',
                                                   'Response'])])),
                ('boosted_xgb_model',
                 XGBClassifier(base_score=None, booster=None, callbacks=None,
                               colsample_byle...
                               feature_types=None, gamma=0.1, grow_policy=None,
                               importance_type=None,
                               interaction_constraints=None, learning_rate=0.01,
                               max_bin=None, max_cat_threshold=None,
                               max_cat_to_onehot=None, max_delta_step=None,
                               max_depth=3, max_leaves=None,
                               min_child_weight=None, missing=nan,
                               monotone_constraints=None, multi_strategy=None,
                               n_estimators=50, n_jobs=None,
                               num_parallel_tree=None, random_state=987654321, ...))])
ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough',
                  transformers=[('cat', OrdinalEncoder(),
                                 ['Gender', 'Smoking', 'Physical_Examination',
                                  'Adenopathy', 'Focality', 'Risk', 'T',
                                  'Stage', 'Response'])])
['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response']
OrdinalEncoder()
['Age']
passthrough
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=0.7, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric='logloss',
              feature_types=None, gamma=0.1, grow_policy=None,
              importance_type=None, interaction_constraints=None,
              learning_rate=0.01, max_bin=None, max_cat_threshold=None,
              max_cat_to_onehot=None, max_delta_step=None, max_depth=3,
              max_leaves=None, min_child_weight=None, missing=nan,
              monotone_constraints=None, multi_strategy=None, n_estimators=50,
              n_jobs=None, num_parallel_tree=None, random_state=987654321, ...)
In [249]:
##################################
# Identifying the best model
##################################
boosted_xgb_optimal = boosted_xgb_grid_search.best_estimator_
In [250]:
##################################
# Evaluating the F1 scores
# on the training, cross-validation, and validation data
##################################
boosted_xgb_optimal_f1_cv = boosted_xgb_grid_search.best_score_
boosted_xgb_optimal_f1_train = f1_score(y_preprocessed_train_encoded, boosted_xgb_optimal.predict(X_preprocessed_train))
boosted_xgb_optimal_f1_validation = f1_score(y_preprocessed_validation_encoded, boosted_xgb_optimal.predict(X_preprocessed_validation))
In [251]:
##################################
# Identifying the optimal model
##################################
print('Best Boosted Model - XGBoost: ')
print(f"Best XGBoost Hyperparameters: {boosted_xgb_grid_search.best_params_}")
Best Boosted Model - XGBoost: 
Best XGBoost Hyperparameters: {'boosted_xgb_model__gamma': 0.1, 'boosted_xgb_model__learning_rate': 0.01, 'boosted_xgb_model__max_depth': 3, 'boosted_xgb_model__n_estimators': 50}
In [252]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the training and cross-validated data
# to assess overfitting optimism
##################################
print(f"F1 Score on Cross-Validated Data: {boosted_xgb_optimal_f1_cv:.4f}")
print(f"F1 Score on Training Data: {boosted_xgb_optimal_f1_train:.4f}")
print("\nClassification Report on Train Data:\n", classification_report(y_preprocessed_train_encoded, boosted_xgb_optimal.predict(X_preprocessed_train)))
F1 Score on Cross-Validated Data: 0.8322
F1 Score on Training Data: 0.8504

Classification Report on Train Data:
               precision    recall  f1-score   support

         0.0       0.95      0.92      0.93       143
         1.0       0.82      0.89      0.85        61

    accuracy                           0.91       204
   macro avg       0.88      0.90      0.89       204
weighted avg       0.91      0.91      0.91       204

In [253]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the train data
##################################
cm_raw = confusion_matrix(y_preprocessed_train_encoded, boosted_xgb_optimal.predict(X_preprocessed_train))
cm_normalized = confusion_matrix(y_preprocessed_train_encoded, boosted_xgb_optimal.predict(X_preprocessed_train), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal XGBoost Train Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal XGBoost Train Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [254]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
# to assess overfitting optimism
##################################
print(f"F1 Score on Validation Data: {boosted_xgb_optimal_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_preprocessed_validation_encoded, boosted_xgb_optimal.predict(X_preprocessed_validation)))
F1 Score on Validation Data: 0.8571

Classification Report on Validation Data:
               precision    recall  f1-score   support

         0.0       0.96      0.92      0.94        49
         1.0       0.82      0.90      0.86        20

    accuracy                           0.91        69
   macro avg       0.89      0.91      0.90        69
weighted avg       0.92      0.91      0.91        69

In [255]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_preprocessed_validation_encoded, boosted_xgb_optimal.predict(X_preprocessed_validation))
cm_normalized = confusion_matrix(y_preprocessed_validation_encoded, boosted_xgb_optimal.predict(X_preprocessed_validation), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal XGBoost Validation Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal XGBoost Validation Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [256]:
##################################
# Gathering the model evaluation metrics
# for the train data
##################################
boosted_xgb_optimal_train = model_performance_evaluation(y_preprocessed_train_encoded, boosted_xgb_optimal.predict(X_preprocessed_train))
boosted_xgb_optimal_train['model'] = ['boosted_xgb_optimal'] * 5
boosted_xgb_optimal_train['set'] = ['train'] * 5
print('Optimal XGBoost Train Performance Metrics: ')
display(boosted_xgb_optimal_train)
Optimal XGBoost Train Performance Metrics: 
metric_name metric_value model set
0 Accuracy 0.906863 boosted_xgb_optimal train
1 Precision 0.818182 boosted_xgb_optimal train
2 Recall 0.885246 boosted_xgb_optimal train
3 F1 0.850394 boosted_xgb_optimal train
4 AUROC 0.900665 boosted_xgb_optimal train
In [257]:
##################################
# Gathering the model evaluation metrics
# for the validation data
##################################
boosted_xgb_optimal_validation = model_performance_evaluation(y_preprocessed_validation_encoded, boosted_xgb_optimal.predict(X_preprocessed_validation))
boosted_xgb_optimal_validation['model'] = ['boosted_xgb_optimal'] * 5
boosted_xgb_optimal_validation['set'] = ['validation'] * 5
print('Optimal XGBoost Validation Performance Metrics: ')
display(boosted_xgb_optimal_validation)
Optimal XGBoost Validation Performance Metrics: 
metric_name metric_value model set
0 Accuracy 0.913043 boosted_xgb_optimal validation
1 Precision 0.818182 boosted_xgb_optimal validation
2 Recall 0.900000 boosted_xgb_optimal validation
3 F1 0.857143 boosted_xgb_optimal validation
4 AUROC 0.909184 boosted_xgb_optimal validation
In [258]:
##################################
# Saving the best individual model
# developed from the train data
################################## 
joblib.dump(boosted_xgb_optimal, 
            os.path.join("..", MODELS_PATH, "boosted_model_xgboost_optimal.pkl"))
Out[258]:
['..\\models\\boosted_model_xgboost_optimal.pkl']

1.8.4 Light GBM ¶

Light GBM (Light Gradient Boosting Machine) is a variation of Gradient Boosting designed for efficiency and scalability. Unlike traditional boosting methods that grow trees level by level, LightGBM grows trees leaf-wise, choosing the most informative splits, leading to faster convergence. It also uses histogram-based binning to speed up computations. These optimizations allow LightGBM to train on large datasets efficiently while maintaining high accuracy. Its advantages include faster training speed, reduced memory usage, and strong predictive performance, particularly for large datasets with many features. However, LightGBM can overfit more easily than XGBoost if not properly tuned, and it may not perform as well on small datasets. Additionally, its handling of categorical variables requires careful preprocessing, and the leaf-wise tree growth can sometimes lead to instability if not controlled properly.

  1. The light gbm model from the light Python library API was implemented.
  2. The model contains 4 hyperparameters for tuning:
    • learning_rate = step size at which weights are updated during training made to vary between 0.01 and 0.10
    • min_child_samples = minimum number of data needed in a child 3 and 6
    • num_leaves = maximum tree leaves for base learners made to vary between 8 and 16
    • n_estimators = number of boosted trees to fit made to vary between 50 and 100
  3. A special hyperparameter (scale_pos_weight = 2.0) was fixed to address the minimal 2:1 class imbalance observed between the No and Yes Recurred categories.
  4. Hyperparameter tuning was conducted using the 5-cycle 5-fold cross-validation method with optimal model performance using the F1 score determined for:
    • learning_rate = 0.01
    • min_child_samples = 6
    • num_leaves 16
    • n_estimators = 100
  5. The apparent model performance of the optimal model is summarized as follows:
    • Accuracy = 0.9362
    • Precision = 0.8870
    • Recall = 0.9016
    • F1 Score = 0.8943
    • AUROC = 0.9263
  6. The independent validation model performance of the optimal model is summarized as follows:
    • Accuracy = 0.8985
    • Precision = 0.8421
    • Recall = 0.8000
    • F1 Score = 0.8205
    • AUROC = 0.8693
  7. Relatively large difference in apparent and independent validation model performance observed that might be indicative of the presence of moderate model overfitting.
In [259]:
##################################
# Defining the categorical preprocessing parameters
##################################
categorical_features = ['Gender','Smoking','Physical_Examination','Adenopathy','Focality','Risk','T','Stage','Response']
categorical_transformer = OrdinalEncoder()
categorical_preprocessor = ColumnTransformer(transformers=[
    ('cat', categorical_transformer, categorical_features)],
                                             remainder='passthrough',
                                             force_int_remainder_cols=False)
In [260]:
##################################
# Defining the preprocessing and modeling pipeline parameters
##################################
boosted_lgbm_pipeline = Pipeline([
    ('categorical_preprocessor', categorical_preprocessor),
    ('boosted_lgbm_model', LGBMClassifier(scale_pos_weight=2.0, 
                                          random_state=987654321,
                                          max_depth=-1,
                                          feature_fraction =0.7,
                                          bagging_fraction=0.7,
                                          verbose=-1))
])
In [261]:
##################################
# Defining hyperparameter grid
##################################
boosted_lgbm_hyperparameter_grid = {
    'boosted_lgbm_model__learning_rate': [0.01, 0.10],
    'boosted_lgbm_model__min_child_samples': [3, 6], 
    'boosted_lgbm_model__num_leaves': [8, 16],
    'boosted_lgbm_model__n_estimators': [50, 100] 
}
In [262]:
##################################
# Defining the cross-validation strategy (5-cycle 5-fold CV)
##################################
cv_strategy = RepeatedStratifiedKFold(n_splits=5, 
                                      n_repeats=5, 
                                      random_state=987654321)
In [263]:
##################################
# Performing Grid Search with cross-validation
##################################
boosted_lgbm_grid_search = GridSearchCV(
    estimator=boosted_lgbm_pipeline,
    param_grid=boosted_lgbm_hyperparameter_grid,
    scoring='f1',
    cv=cv_strategy,
    n_jobs=-1,
    verbose=1
)
In [264]:
##################################
# Encoding the response variables
# for model evaluation
##################################
y_encoder = OrdinalEncoder()
y_encoder.fit(y_preprocessed_train.values.reshape(-1, 1))
y_preprocessed_train_encoded = y_encoder.transform(y_preprocessed_train.values.reshape(-1, 1)).ravel()
y_preprocessed_validation_encoded = y_encoder.transform(y_preprocessed_validation.values.reshape(-1, 1)).ravel()
In [265]:
##################################
# Fitting GridSearchCV
##################################
boosted_lgbm_grid_search.fit(X_preprocessed_train, y_preprocessed_train_encoded)
Fitting 25 folds for each of 16 candidates, totalling 400 fits
Out[265]:
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])...
                                       ('boosted_lgbm_model',
                                        LGBMClassifier(bagging_fraction=0.7,
                                                       feature_fraction=0.7,
                                                       random_state=987654321,
                                                       scale_pos_weight=2.0,
                                                       verbose=-1))]),
             n_jobs=-1,
             param_grid={'boosted_lgbm_model__learning_rate': [0.01, 0.1],
                         'boosted_lgbm_model__min_child_samples': [3, 6],
                         'boosted_lgbm_model__n_estimators': [50, 100],
                         'boosted_lgbm_model__num_leaves': [8, 16]},
             scoring='f1', verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])...
                                       ('boosted_lgbm_model',
                                        LGBMClassifier(bagging_fraction=0.7,
                                                       feature_fraction=0.7,
                                                       random_state=987654321,
                                                       scale_pos_weight=2.0,
                                                       verbose=-1))]),
             n_jobs=-1,
             param_grid={'boosted_lgbm_model__learning_rate': [0.01, 0.1],
                         'boosted_lgbm_model__min_child_samples': [3, 6],
                         'boosted_lgbm_model__n_estimators': [50, 100],
                         'boosted_lgbm_model__num_leaves': [8, 16]},
             scoring='f1', verbose=1)
Pipeline(steps=[('categorical_preprocessor',
                 ColumnTransformer(force_int_remainder_cols=False,
                                   remainder='passthrough',
                                   transformers=[('cat', OrdinalEncoder(),
                                                  ['Gender', 'Smoking',
                                                   'Physical_Examination',
                                                   'Adenopathy', 'Focality',
                                                   'Risk', 'T', 'Stage',
                                                   'Response'])])),
                ('boosted_lgbm_model',
                 LGBMClassifier(bagging_fraction=0.7, feature_fraction=0.7,
                                learning_rate=0.01, min_child_samples=6,
                                num_leaves=16, random_state=987654321,
                                scale_pos_weight=2.0, verbose=-1))])
ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough',
                  transformers=[('cat', OrdinalEncoder(),
                                 ['Gender', 'Smoking', 'Physical_Examination',
                                  'Adenopathy', 'Focality', 'Risk', 'T',
                                  'Stage', 'Response'])])
['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response']
OrdinalEncoder()
['Age']
passthrough
LGBMClassifier(bagging_fraction=0.7, feature_fraction=0.7, learning_rate=0.01,
               min_child_samples=6, num_leaves=16, random_state=987654321,
               scale_pos_weight=2.0, verbose=-1)
In [266]:
##################################
# Identifying the best model
##################################
boosted_lgbm_optimal = boosted_lgbm_grid_search.best_estimator_
In [267]:
##################################
# Evaluating the F1 scores
# on the training, cross-validation, and validation data
##################################
import warnings
warnings.filterwarnings('ignore', category=UserWarning, module='sklearn.utils.validation')
boosted_lgbm_optimal_f1_cv = boosted_lgbm_grid_search.best_score_
boosted_lgbm_optimal_f1_train = f1_score(y_preprocessed_train_encoded, boosted_lgbm_optimal.predict(X_preprocessed_train))
boosted_lgbm_optimal_f1_validation = f1_score(y_preprocessed_validation_encoded, boosted_lgbm_optimal.predict(X_preprocessed_validation))
In [268]:
##################################
# Identifying the optimal model
##################################
print('Best Boosted Model - Light GBM: ')
print(f"Best Light GBM Hyperparameters: {boosted_lgbm_grid_search.best_params_}")
Best Boosted Model - Light GBM: 
Best Light GBM Hyperparameters: {'boosted_lgbm_model__learning_rate': 0.01, 'boosted_lgbm_model__min_child_samples': 6, 'boosted_lgbm_model__n_estimators': 100, 'boosted_lgbm_model__num_leaves': 16}
In [269]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the training and cross-validated data
# to assess overfitting optimism
##################################
print(f"F1 Score on Cross-Validated Data: {boosted_lgbm_optimal_f1_cv:.4f}")
print(f"F1 Score on Training Data: {boosted_lgbm_optimal_f1_train:.4f}")
print("\nClassification Report on Train Data:\n", classification_report(y_preprocessed_train_encoded, boosted_lgbm_optimal.predict(X_preprocessed_train)))
F1 Score on Cross-Validated Data: 0.8182
F1 Score on Training Data: 0.8943

Classification Report on Train Data:
               precision    recall  f1-score   support

         0.0       0.96      0.95      0.95       143
         1.0       0.89      0.90      0.89        61

    accuracy                           0.94       204
   macro avg       0.92      0.93      0.92       204
weighted avg       0.94      0.94      0.94       204

In [270]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the train data
##################################
cm_raw = confusion_matrix(y_preprocessed_train_encoded, boosted_lgbm_optimal.predict(X_preprocessed_train))
cm_normalized = confusion_matrix(y_preprocessed_train_encoded, boosted_lgbm_optimal.predict(X_preprocessed_train), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Light GBM Train Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Light GBM Train Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [271]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
# to assess overfitting optimism
##################################
print(f"F1 Score on Validation Data: {boosted_lgbm_optimal_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_preprocessed_validation_encoded, boosted_lgbm_optimal.predict(X_preprocessed_validation)))
F1 Score on Validation Data: 0.8205

Classification Report on Validation Data:
               precision    recall  f1-score   support

         0.0       0.92      0.94      0.93        49
         1.0       0.84      0.80      0.82        20

    accuracy                           0.90        69
   macro avg       0.88      0.87      0.87        69
weighted avg       0.90      0.90      0.90        69

In [272]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_preprocessed_validation_encoded, boosted_lgbm_optimal.predict(X_preprocessed_validation))
cm_normalized = confusion_matrix(y_preprocessed_validation_encoded, boosted_lgbm_optimal.predict(X_preprocessed_validation), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Light GBM Validation Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Light GBM Validation Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [273]:
##################################
# Gathering the model evaluation metrics
# for the train data
##################################
boosted_lgbm_optimal_train = model_performance_evaluation(y_preprocessed_train_encoded, boosted_lgbm_optimal.predict(X_preprocessed_train))
boosted_lgbm_optimal_train['model'] = ['boosted_lgbm_optimal'] * 5
boosted_lgbm_optimal_train['set'] = ['train'] * 5
print('Optimal Light GBM Train Performance Metrics: ')
display(boosted_lgbm_optimal_train)
Optimal Light GBM Train Performance Metrics: 
metric_name metric_value model set
0 Accuracy 0.936275 boosted_lgbm_optimal train
1 Precision 0.887097 boosted_lgbm_optimal train
2 Recall 0.901639 boosted_lgbm_optimal train
3 F1 0.894309 boosted_lgbm_optimal train
4 AUROC 0.926344 boosted_lgbm_optimal train
In [274]:
##################################
# Gathering the model evaluation metrics
# for the validation data
##################################
boosted_lgbm_optimal_validation = model_performance_evaluation(y_preprocessed_validation_encoded, boosted_lgbm_optimal.predict(X_preprocessed_validation))
boosted_lgbm_optimal_validation['model'] = ['boosted_lgbm_optimal'] * 5
boosted_lgbm_optimal_validation['set'] = ['validation'] * 5
print('Optimal Light GBM Validation Performance Metrics: ')
display(boosted_lgbm_optimal_validation)
Optimal Light GBM Validation Performance Metrics: 
metric_name metric_value model set
0 Accuracy 0.898551 boosted_lgbm_optimal validation
1 Precision 0.842105 boosted_lgbm_optimal validation
2 Recall 0.800000 boosted_lgbm_optimal validation
3 F1 0.820513 boosted_lgbm_optimal validation
4 AUROC 0.869388 boosted_lgbm_optimal validation
In [275]:
##################################
# Saving the best individual model
# developed from the train data
################################## 
joblib.dump(boosted_lgbm_optimal, 
            os.path.join("..", MODELS_PATH, "boosted_model_light_gbm_optimal.pkl"))
Out[275]:
['..\\models\\boosted_model_light_gbm_optimal.pkl']

1.8.5 CatBoost ¶

CatBoost (Categorical Boosting) is a boosting algorithm optimized for categorical data. Unlike other gradient boosting methods that require categorical variables to be manually encoded, CatBoost handles them natively, reducing preprocessing effort and improving performance. It builds decision trees iteratively, like other boosting methods, but uses ordered boosting to prevent target leakage and enhance generalization. The main advantages of CatBoost are its ability to handle categorical data without extensive preprocessing, high accuracy with minimal tuning, and robustness against overfitting due to built-in regularization. Additionally, it is relatively fast and memory-efficient. However, CatBoost can still be slower than LightGBM on very large datasets, and while it requires less tuning, improper parameter selection can lead to suboptimal performance. Its internal mechanics, such as ordered boosting, make interpretation more complex compared to simpler models.

  1. The catboost model from the catboost Python library API was implemented.
  2. The model contains 4 hyperparameters for tuning:
    • learning_rate = step size at which weights are updated during training made to vary between 0.01 and 0.10
    • max_depth = maximum depth of each decision tree in the boosting process made to vary between 3 and 6
    • num_leaves = maximum tree leaves for base learners made to vary between 8 and 16
    • iterations = number of boosted trees to fit made to vary between 50 and 100
  3. A special hyperparameter (scale_pos_weight = 2.0) was fixed to address the minimal 2:1 class imbalance observed between the No and Yes Recurred categories.
  4. Hyperparameter tuning was conducted using the 5-cycle 5-fold cross-validation method with optimal model performance using the F1 score determined for:
    • learning_rate = 0.01
    • min_child_samples = 3
    • num_leaves = 8
    • n_estimators = 50
  5. The apparent model performance of the optimal model is summarized as follows:
    • Accuracy = 0.9019
    • Precision = 0.8059
    • Recall = 0.8852
    • F1 Score = 0.8437
    • AUROC = 0.8971
  6. The independent validation model performance of the optimal model is summarized as follows:
    • Accuracy = 0.9130
    • Precision = 0.8181
    • Recall = 0.9000
    • F1 Score = 0.8571
    • AUROC = 0.9091
  7. Sufficiently comparable apparent and independent validation model performance observed that might be indicative of the absence of excessive model overfitting.
In [276]:
##################################
# Defining the categorical preprocessing parameters
##################################
categorical_features = ['Gender','Smoking','Physical_Examination','Adenopathy','Focality','Risk','T','Stage','Response']
categorical_transformer = OrdinalEncoder()
categorical_preprocessor = ColumnTransformer(transformers=[
    ('cat', categorical_transformer, categorical_features)],
                                             remainder='passthrough',
                                             force_int_remainder_cols=False)
In [277]:
##################################
# Defining the preprocessing and modeling pipeline parameters
##################################
boosted_cb_pipeline = Pipeline([
    ('categorical_preprocessor', categorical_preprocessor),
    ('boosted_cb_model', CatBoostClassifier(scale_pos_weight=2.0, 
                                            random_state=987654321,
                                            subsample =0.7,
                                            colsample_bylevel=0.7,
                                           grow_policy='Lossguide'))
])
In [278]:
##################################
# Defining hyperparameter grid
##################################
boosted_cb_hyperparameter_grid = {
    'boosted_cb_model__learning_rate': [0.01, 0.10],
    'boosted_cb_model__max_depth': [3, 6], 
    'boosted_cb_model__num_leaves': [8, 16],
    'boosted_cb_model__iterations': [50, 100] 
}
In [279]:
##################################
# Defining the cross-validation strategy (5-cycle 5-fold CV)
##################################
cv_strategy = RepeatedStratifiedKFold(n_splits=5, 
                                      n_repeats=5, 
                                      random_state=987654321)
In [280]:
##################################
# Performing Grid Search with cross-validation
##################################
boosted_cb_grid_search = GridSearchCV(
    estimator=boosted_cb_pipeline,
    param_grid=boosted_cb_hyperparameter_grid,
    scoring='f1',
    cv=cv_strategy,
    n_jobs=-1,
    verbose=1
)
In [281]:
##################################
# Encoding the response variables
# for model evaluation
##################################
y_encoder = OrdinalEncoder()
y_encoder.fit(y_preprocessed_train.values.reshape(-1, 1))
y_preprocessed_train_encoded = y_encoder.transform(y_preprocessed_train.values.reshape(-1, 1)).ravel()
y_preprocessed_validation_encoded = y_encoder.transform(y_preprocessed_validation.values.reshape(-1, 1)).ravel()
In [282]:
##################################
# Fitting GridSearchCV
##################################
boosted_cb_grid_search.fit(X_preprocessed_train, y_preprocessed_train_encoded)
Fitting 25 folds for each of 16 candidates, totalling 400 fits
0:	learn: 0.6891722	total: 142ms	remaining: 6.93s
1:	learn: 0.6834783	total: 143ms	remaining: 3.43s
2:	learn: 0.6782963	total: 144ms	remaining: 2.25s
3:	learn: 0.6734680	total: 145ms	remaining: 1.67s
4:	learn: 0.6687357	total: 146ms	remaining: 1.31s
5:	learn: 0.6634680	total: 147ms	remaining: 1.08s
6:	learn: 0.6585557	total: 148ms	remaining: 908ms
7:	learn: 0.6543455	total: 149ms	remaining: 783ms
8:	learn: 0.6494274	total: 150ms	remaining: 683ms
9:	learn: 0.6445245	total: 151ms	remaining: 603ms
10:	learn: 0.6403235	total: 152ms	remaining: 538ms
11:	learn: 0.6356199	total: 152ms	remaining: 483ms
12:	learn: 0.6312758	total: 154ms	remaining: 437ms
13:	learn: 0.6272985	total: 155ms	remaining: 398ms
14:	learn: 0.6234670	total: 156ms	remaining: 364ms
15:	learn: 0.6188170	total: 157ms	remaining: 333ms
16:	learn: 0.6149020	total: 158ms	remaining: 307ms
17:	learn: 0.6107420	total: 159ms	remaining: 283ms
18:	learn: 0.6069101	total: 160ms	remaining: 261ms
19:	learn: 0.6029967	total: 161ms	remaining: 241ms
20:	learn: 0.5990690	total: 162ms	remaining: 224ms
21:	learn: 0.5950791	total: 163ms	remaining: 208ms
22:	learn: 0.5910606	total: 164ms	remaining: 193ms
23:	learn: 0.5872759	total: 165ms	remaining: 179ms
24:	learn: 0.5831229	total: 166ms	remaining: 166ms
25:	learn: 0.5800303	total: 168ms	remaining: 155ms
26:	learn: 0.5767067	total: 169ms	remaining: 144ms
27:	learn: 0.5733769	total: 170ms	remaining: 133ms
28:	learn: 0.5702532	total: 171ms	remaining: 124ms
29:	learn: 0.5673687	total: 172ms	remaining: 114ms
30:	learn: 0.5645879	total: 173ms	remaining: 106ms
31:	learn: 0.5613671	total: 174ms	remaining: 97.6ms
32:	learn: 0.5583988	total: 175ms	remaining: 90ms
33:	learn: 0.5553886	total: 176ms	remaining: 82.7ms
34:	learn: 0.5518851	total: 177ms	remaining: 75.7ms
35:	learn: 0.5491829	total: 178ms	remaining: 69.2ms
36:	learn: 0.5464052	total: 179ms	remaining: 63ms
37:	learn: 0.5437216	total: 181ms	remaining: 57ms
38:	learn: 0.5410767	total: 182ms	remaining: 51.2ms
39:	learn: 0.5383734	total: 183ms	remaining: 45.6ms
40:	learn: 0.5354526	total: 184ms	remaining: 40.3ms
41:	learn: 0.5326437	total: 185ms	remaining: 35.1ms
42:	learn: 0.5296890	total: 186ms	remaining: 30.2ms
43:	learn: 0.5267096	total: 187ms	remaining: 25.5ms
44:	learn: 0.5244612	total: 188ms	remaining: 20.9ms
45:	learn: 0.5216877	total: 189ms	remaining: 16.5ms
46:	learn: 0.5186363	total: 190ms	remaining: 12.1ms
47:	learn: 0.5158441	total: 191ms	remaining: 7.98ms
48:	learn: 0.5132195	total: 193ms	remaining: 3.93ms
49:	learn: 0.5104675	total: 193ms	remaining: 0us
Out[282]:
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])])),
                                       ('boosted_cb_model',
                                        <catboost.core.CatBoostClassifier object at 0x000002AD5551D3D0>)]),
             n_jobs=-1,
             param_grid={'boosted_cb_model__iterations': [50, 100],
                         'boosted_cb_model__learning_rate': [0.01, 0.1],
                         'boosted_cb_model__max_depth': [3, 6],
                         'boosted_cb_model__num_leaves': [8, 16]},
             scoring='f1', verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])])),
                                       ('boosted_cb_model',
                                        <catboost.core.CatBoostClassifier object at 0x000002AD5551D3D0>)]),
             n_jobs=-1,
             param_grid={'boosted_cb_model__iterations': [50, 100],
                         'boosted_cb_model__learning_rate': [0.01, 0.1],
                         'boosted_cb_model__max_depth': [3, 6],
                         'boosted_cb_model__num_leaves': [8, 16]},
             scoring='f1', verbose=1)
Pipeline(steps=[('categorical_preprocessor',
                 ColumnTransformer(force_int_remainder_cols=False,
                                   remainder='passthrough',
                                   transformers=[('cat', OrdinalEncoder(),
                                                  ['Gender', 'Smoking',
                                                   'Physical_Examination',
                                                   'Adenopathy', 'Focality',
                                                   'Risk', 'T', 'Stage',
                                                   'Response'])])),
                ('boosted_cb_model',
                 <catboost.core.CatBoostClassifier object at 0x000002AD56518A70>)])
ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough',
                  transformers=[('cat', OrdinalEncoder(),
                                 ['Gender', 'Smoking', 'Physical_Examination',
                                  'Adenopathy', 'Focality', 'Risk', 'T',
                                  'Stage', 'Response'])])
['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response']
OrdinalEncoder()
['Age']
passthrough
<catboost.core.CatBoostClassifier object at 0x000002AD56518A70>
In [283]:
##################################
# Identifying the best model
##################################
boosted_cb_optimal = boosted_cb_grid_search.best_estimator_
In [284]:
##################################
# Evaluating the F1 scores
# on the training, cross-validation, and validation data
##################################
boosted_cb_optimal_f1_cv = boosted_cb_grid_search.best_score_
boosted_cb_optimal_f1_train = f1_score(y_preprocessed_train_encoded, boosted_cb_optimal.predict(X_preprocessed_train))
boosted_cb_optimal_f1_validation = f1_score(y_preprocessed_validation_encoded, boosted_cb_optimal.predict(X_preprocessed_validation))
In [285]:
##################################
# Identifying the optimal model
##################################
print('Best Boosted Model - CatBoost: ')
print(f"Best CatBoost Hyperparameters: {boosted_cb_grid_search.best_params_}")
Best Boosted Model - CatBoost: 
Best CatBoost Hyperparameters: {'boosted_cb_model__iterations': 50, 'boosted_cb_model__learning_rate': 0.01, 'boosted_cb_model__max_depth': 3, 'boosted_cb_model__num_leaves': 8}
In [286]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the training and cross-validated data
# to assess overfitting optimism
##################################
print(f"F1 Score on Cross-Validated Data: {boosted_cb_optimal_f1_cv:.4f}")
print(f"F1 Score on Training Data: {boosted_cb_optimal_f1_train:.4f}")
print("\nClassification Report on Train Data:\n", classification_report(y_preprocessed_train_encoded, boosted_cb_optimal.predict(X_preprocessed_train)))
F1 Score on Cross-Validated Data: 0.8259
F1 Score on Training Data: 0.8438

Classification Report on Train Data:
               precision    recall  f1-score   support

         0.0       0.95      0.91      0.93       143
         1.0       0.81      0.89      0.84        61

    accuracy                           0.90       204
   macro avg       0.88      0.90      0.89       204
weighted avg       0.91      0.90      0.90       204

In [287]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the train data
##################################
cm_raw = confusion_matrix(y_preprocessed_train_encoded, boosted_cb_optimal.predict(X_preprocessed_train))
cm_normalized = confusion_matrix(y_preprocessed_train_encoded, boosted_cb_optimal.predict(X_preprocessed_train), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal CatBoost Train Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal CatBoost Train Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [288]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
# to assess overfitting optimism
##################################
print(f"F1 Score on Validation Data: {boosted_cb_optimal_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_preprocessed_validation_encoded, boosted_cb_optimal.predict(X_preprocessed_validation)))
F1 Score on Validation Data: 0.8571

Classification Report on Validation Data:
               precision    recall  f1-score   support

         0.0       0.96      0.92      0.94        49
         1.0       0.82      0.90      0.86        20

    accuracy                           0.91        69
   macro avg       0.89      0.91      0.90        69
weighted avg       0.92      0.91      0.91        69

In [289]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_preprocessed_validation_encoded, boosted_cb_optimal.predict(X_preprocessed_validation))
cm_normalized = confusion_matrix(y_preprocessed_validation_encoded, boosted_cb_optimal.predict(X_preprocessed_validation), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal CatBoost Validation Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal CatBoost Validation Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [290]:
##################################
# Gathering the model evaluation metrics
# for the train data
##################################
boosted_cb_optimal_train = model_performance_evaluation(y_preprocessed_train_encoded, boosted_cb_optimal.predict(X_preprocessed_train))
boosted_cb_optimal_train['model'] = ['boosted_cb_optimal'] * 5
boosted_cb_optimal_train['set'] = ['train'] * 5
print('Optimal CatBoost Train Performance Metrics: ')
display(boosted_cb_optimal_train)
Optimal CatBoost Train Performance Metrics: 
metric_name metric_value model set
0 Accuracy 0.901961 boosted_cb_optimal train
1 Precision 0.805970 boosted_cb_optimal train
2 Recall 0.885246 boosted_cb_optimal train
3 F1 0.843750 boosted_cb_optimal train
4 AUROC 0.897168 boosted_cb_optimal train
In [291]:
##################################
# Gathering the model evaluation metrics
# for the validation data
##################################
boosted_cb_optimal_validation = model_performance_evaluation(y_preprocessed_validation_encoded, boosted_cb_optimal.predict(X_preprocessed_validation))
boosted_cb_optimal_validation['model'] = ['boosted_cb_optimal'] * 5
boosted_cb_optimal_validation['set'] = ['validation'] * 5
print('Optimal CatBoost Validation Performance Metrics: ')
display(boosted_cb_optimal_validation)
Optimal CatBoost Validation Performance Metrics: 
metric_name metric_value model set
0 Accuracy 0.913043 boosted_cb_optimal validation
1 Precision 0.818182 boosted_cb_optimal validation
2 Recall 0.900000 boosted_cb_optimal validation
3 F1 0.857143 boosted_cb_optimal validation
4 AUROC 0.909184 boosted_cb_optimal validation
In [292]:
##################################
# Saving the best individual model
# developed from the train data
################################## 
joblib.dump(boosted_cb_optimal, 
            os.path.join("..", MODELS_PATH, "boosted_model_catboost_optimal.pkl"))
Out[292]:
['..\\models\\boosted_model_catboost_optimal.pkl']

1.9. Stacked Model Development ¶

Stacking, or stacked generalization, is an advanced ensemble method that improves predictive performance by training a meta-model to learn the optimal way to combine multiple base models using their out-of-fold predictions. Unlike traditional ensemble techniques such as bagging and boosting, which aggregate predictions through simple rules like averaging or majority voting, stacking introduces a second-level model that intelligently learns how to integrate diverse base models. The process starts by training multiple classifiers on the training dataset. However, instead of directly using their predictions, stacking employs k-fold cross-validation to generate out-of-fold predictions. Specifically, each base model is trained on a subset of the training data while leaving out a validation fold, and predictions on that unseen fold are recorded. This process is repeated across all folds, ensuring that each instance in the training data receives predictions from models that never saw it during training. These out-of-fold predictions are then used as input features for a meta-model, which learns the best way to combine them into a final decision. The advantage of stacking is that it allows different models to complement each other, capturing diverse aspects of the data that a single model might miss. This often results in superior classification accuracy compared to individual models or simpler ensemble approaches. However, stacking is computationally expensive, requiring multiple training iterations for base models and the additional meta-model. It also demands careful tuning to prevent overfitting, as the meta-model’s complexity can introduce new sources of error. Despite these challenges, stacking remains a powerful technique in applications where maximizing predictive performance is a priority.

1.9.1 Base Learner - K-Nearest Neighbors ¶

K-Nearest Neighbors (KNN) is a non-parametric classification algorithm that makes predictions based on the majority class among the k-nearest training samples in feature space. It does not create an explicit model during training; instead, it stores the entire dataset and computes distances between a query point and all training samples during inference. The algorithm follows three key steps: (1) compute the distance between the query point and all training samples (typically using Euclidean distance), (2) identify the k closest points, and (3) assign the most common class among them as the predicted label. KNN is advantageous because it is simple, requires minimal training time, and can model complex decision boundaries when provided with sufficient data. However, it has significant drawbacks: it is computationally expensive for large datasets since distances must be computed for every prediction, it is sensitive to irrelevant or redundant features, and it requires careful selection of k, as a small k can make the model too sensitive to noise while a large k can overly smooth decision boundaries.

  1. The k-nearest neighbors model from the sklearn.ensemble Python library API was implemented.
  2. The model contains 3 hyperparameters for tuning:
    • n_neighbors = number of neighbors to use made to vary between 3 and 5
    • weights = weight function used in prediction made to vary between uniform and distance
    • metric = metric to use for distance computation made to vary between minkowski and euclidean
  3. No any hyperparameter was defined in the model to address the minimal 2:1 class imbalance observed between the No and Yes Recurred categories.
  4. Hyperparameter tuning was conducted using the 5-cycle 5-fold cross-validation method with optimal model performance using the F1 score determined for:
    • n_neighbors = 3
    • weights = uniform
    • metric = minkowski
  5. The apparent model performance of the optimal model is summarized as follows:
    • Accuracy = 0.9215
    • Precision = 0.9090
    • Recall = 0.8196
    • F1 Score = 0.8620
    • AUROC = 0.8923
  6. The independent validation model performance of the optimal model is summarized as follows:
    • Accuracy = 0.8115
    • Precision = 0.7058
    • Recall = 0.6000
    • F1 Score = 0.6486
    • AUROC = 0.7489
  7. Relatively large difference in apparent and independent validation model performance observed that might be indicative of the presence of moderate model overfitting.
In [293]:
##################################
# Defining the categorical preprocessing parameters
##################################
categorical_features = ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response']
categorical_transformer = OrdinalEncoder()
categorical_preprocessor = ColumnTransformer(transformers=[
    ('cat', categorical_transformer, categorical_features)],
    remainder='passthrough',
    force_int_remainder_cols=False)
In [294]:
##################################
# Defining the preprocessing and modeling pipeline parameters
##################################
stacked_baselearner_knn_pipeline = Pipeline([
    ('categorical_preprocessor', categorical_preprocessor),
    ('stacked_baselearner_knn_model', KNeighborsClassifier())
])
In [295]:
##################################
# Defining hyperparameter grid
##################################
stacked_baselearner_knn_hyperparameter_grid = {
    'stacked_baselearner_knn_model__n_neighbors': [3, 5],
    'stacked_baselearner_knn_model__weights': ['uniform', 'distance'],
    'stacked_baselearner_knn_model__metric': ['minkowski', 'euclidean']
}
In [296]:
##################################
# Defining the cross-validation strategy (5-cycle 5-fold CV)
##################################
cv_strategy = RepeatedStratifiedKFold(n_splits=5, 
                                      n_repeats=5, 
                                      random_state=987654321)
In [297]:
##################################
# Performing Grid Search with cross-validation
##################################
stacked_baselearner_knn_grid_search = GridSearchCV(
    estimator=stacked_baselearner_knn_pipeline,
    param_grid=stacked_baselearner_knn_hyperparameter_grid,
    scoring='f1',
    cv=cv_strategy,
    n_jobs=-1,
    verbose=1
)
In [298]:
##################################
# Encoding the response variables
##################################
y_encoder = OrdinalEncoder()
y_encoder.fit(y_preprocessed_train.values.reshape(-1, 1))
y_preprocessed_train_encoded = y_encoder.transform(y_preprocessed_train.values.reshape(-1, 1)).ravel()
y_preprocessed_validation_encoded = y_encoder.transform(y_preprocessed_validation.values.reshape(-1, 1)).ravel()
In [299]:
##################################
# Fitting GridSearchCV
##################################
stacked_baselearner_knn_grid_search.fit(X_preprocessed_train, y_preprocessed_train_encoded)
Fitting 25 folds for each of 8 candidates, totalling 200 fits
Out[299]:
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])])),
                                       ('stacked_baselearner_knn_model',
                                        KNeighborsClassifier())]),
             n_jobs=-1,
             param_grid={'stacked_baselearner_knn_model__metric': ['minkowski',
                                                                   'euclidean'],
                         'stacked_baselearner_knn_model__n_neighbors': [3, 5],
                         'stacked_baselearner_knn_model__weights': ['uniform',
                                                                    'distance']},
             scoring='f1', verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])])),
                                       ('stacked_baselearner_knn_model',
                                        KNeighborsClassifier())]),
             n_jobs=-1,
             param_grid={'stacked_baselearner_knn_model__metric': ['minkowski',
                                                                   'euclidean'],
                         'stacked_baselearner_knn_model__n_neighbors': [3, 5],
                         'stacked_baselearner_knn_model__weights': ['uniform',
                                                                    'distance']},
             scoring='f1', verbose=1)
Pipeline(steps=[('categorical_preprocessor',
                 ColumnTransformer(force_int_remainder_cols=False,
                                   remainder='passthrough',
                                   transformers=[('cat', OrdinalEncoder(),
                                                  ['Gender', 'Smoking',
                                                   'Physical_Examination',
                                                   'Adenopathy', 'Focality',
                                                   'Risk', 'T', 'Stage',
                                                   'Response'])])),
                ('stacked_baselearner_knn_model',
                 KNeighborsClassifier(n_neighbors=3))])
ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough',
                  transformers=[('cat', OrdinalEncoder(),
                                 ['Gender', 'Smoking', 'Physical_Examination',
                                  'Adenopathy', 'Focality', 'Risk', 'T',
                                  'Stage', 'Response'])])
['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response']
OrdinalEncoder()
['Age']
passthrough
KNeighborsClassifier(n_neighbors=3)
In [300]:
##################################
# Identifying the best model
##################################
stacked_baselearner_knn_optimal = stacked_baselearner_knn_grid_search.best_estimator_
In [301]:
##################################
# Evaluating the F1 scores
# on the training, cross-validation, and validation data
##################################
stacked_baselearner_knn_optimal_f1_cv = stacked_baselearner_knn_grid_search.best_score_
stacked_baselearner_knn_optimal_f1_train = f1_score(y_preprocessed_train_encoded, stacked_baselearner_knn_optimal.predict(X_preprocessed_train))
stacked_baselearner_knn_optimal_f1_validation = f1_score(y_preprocessed_validation_encoded, stacked_baselearner_knn_optimal.predict(X_preprocessed_validation))
In [302]:
##################################
# Identifying the optimal model
##################################
print('Best Stacked Base Learner KNN: ')
print(f"Best Stacked Base Learner KNN Hyperparameters: {stacked_baselearner_knn_grid_search.best_params_}")
Best Stacked Base Learner KNN: 
Best Stacked Base Learner KNN Hyperparameters: {'stacked_baselearner_knn_model__metric': 'minkowski', 'stacked_baselearner_knn_model__n_neighbors': 3, 'stacked_baselearner_knn_model__weights': 'uniform'}
In [303]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the training and cross-validated data
# to assess overfitting optimism
##################################
print(f"F1 Score on Cross-Validated Data: {stacked_baselearner_knn_optimal_f1_cv:.4f}")
print(f"F1 Score on Training Data: {stacked_baselearner_knn_optimal_f1_train:.4f}")
print("\nClassification Report on Train Data:\n", classification_report(y_preprocessed_train_encoded, stacked_baselearner_knn_optimal.predict(X_preprocessed_train)))
F1 Score on Cross-Validated Data: 0.6417
F1 Score on Training Data: 0.8621

Classification Report on Train Data:
               precision    recall  f1-score   support

         0.0       0.93      0.97      0.95       143
         1.0       0.91      0.82      0.86        61

    accuracy                           0.92       204
   macro avg       0.92      0.89      0.90       204
weighted avg       0.92      0.92      0.92       204

In [304]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the train data
##################################
cm_raw = confusion_matrix(y_preprocessed_train_encoded, stacked_baselearner_knn_optimal.predict(X_preprocessed_train))
cm_normalized = confusion_matrix(y_preprocessed_train_encoded, stacked_baselearner_knn_optimal.predict(X_preprocessed_train), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Stacked Base Learner KNN Train Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Stacked Base Learner KNN Train Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [305]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
# to assess overfitting optimism
##################################
print(f"F1 Score on Validation Data: {stacked_baselearner_knn_optimal_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_preprocessed_validation_encoded, stacked_baselearner_knn_optimal.predict(X_preprocessed_validation)))
F1 Score on Validation Data: 0.6486

Classification Report on Validation Data:
               precision    recall  f1-score   support

         0.0       0.85      0.90      0.87        49
         1.0       0.71      0.60      0.65        20

    accuracy                           0.81        69
   macro avg       0.78      0.75      0.76        69
weighted avg       0.81      0.81      0.81        69

In [306]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_preprocessed_validation_encoded, stacked_baselearner_knn_optimal.predict(X_preprocessed_validation))
cm_normalized = confusion_matrix(y_preprocessed_validation_encoded, stacked_baselearner_knn_optimal.predict(X_preprocessed_validation), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Stacked Base Learner KNN Validation Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Stacked Base Learner KNN Validation Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [307]:
##################################
# Gathering the model evaluation metrics
# for the train data
##################################
stacked_baselearner_knn_optimal_train = model_performance_evaluation(y_preprocessed_train_encoded, stacked_baselearner_knn_optimal.predict(X_preprocessed_train))
stacked_baselearner_knn_optimal_train['model'] = ['stacked_baselearner_knn_optimal'] * 5
stacked_baselearner_knn_optimal_train['set'] = ['train'] * 5
print('Optimal Stacked Base Learner KNN Train Performance Metrics: ')
display(stacked_baselearner_knn_optimal_train)
Optimal Stacked Base Learner KNN Train Performance Metrics: 
metric_name metric_value model set
0 Accuracy 0.921569 stacked_baselearner_knn_optimal train
1 Precision 0.909091 stacked_baselearner_knn_optimal train
2 Recall 0.819672 stacked_baselearner_knn_optimal train
3 F1 0.862069 stacked_baselearner_knn_optimal train
4 AUROC 0.892354 stacked_baselearner_knn_optimal train
In [308]:
##################################
# Gathering the model evaluation metrics
# for the validation data
##################################
stacked_baselearner_knn_optimal_validation = model_performance_evaluation(y_preprocessed_validation_encoded, stacked_baselearner_knn_optimal.predict(X_preprocessed_validation))
stacked_baselearner_knn_optimal_validation['model'] = ['stacked_baselearner_knn_optimal'] * 5
stacked_baselearner_knn_optimal_validation['set'] = ['validation'] * 5
print('Optimal Stacked Base Learner KNN Validation Performance Metrics: ')
display(stacked_baselearner_knn_optimal_validation)
Optimal Stacked Base Learner KNN Validation Performance Metrics: 
metric_name metric_value model set
0 Accuracy 0.811594 stacked_baselearner_knn_optimal validation
1 Precision 0.705882 stacked_baselearner_knn_optimal validation
2 Recall 0.600000 stacked_baselearner_knn_optimal validation
3 F1 0.648649 stacked_baselearner_knn_optimal validation
4 AUROC 0.748980 stacked_baselearner_knn_optimal validation
In [309]:
##################################
# Saving the best individual model
# developed from the train data
################################## 
joblib.dump(stacked_baselearner_knn_optimal, 
            os.path.join("..", MODELS_PATH, "stacked_model_baselearner_knn_optimal.pkl"))
Out[309]:
['..\\models\\stacked_model_baselearner_knn_optimal.pkl']

1.9.2 Base Learner - Support Vector Machine ¶

Support Vector Machine (SVM) is a powerful classification algorithm that finds an optimal decision boundary — called a hyperplane — that maximizes the margin between two classes. The algorithm works by identifying the most influential data points, known as support vectors, that define this boundary. If the data is not linearly separable, SVM can use kernel functions to map it into a higher-dimensional space where separation is possible. The main advantages of SVM include strong theoretical guarantees, effectiveness in high-dimensional spaces, and robustness against overfitting when properly regularized. It performs well when the margin between classes is clear and works effectively with small to medium-sized datasets. However, SVM has notable limitations: it is computationally expensive, making it impractical for very large datasets; it requires careful tuning of hyperparameters such as the kernel type and regularization strength; and it is not easily interpretable, as decision boundaries in high-dimensional space can be difficult to visualize.

  1. The support vector machine model from the sklearn.svm Python library API was implemented.
  2. The model contains 3 hyperparameters for tuning:
    • C = inverse of regularization strength made to vary between 0.1 and 1.0
    • kernel = kernel type to be used in the algorithm made to vary between linear and rbf
    • gamma = kernel coefficient made to vary between scale and auto
  3. A special hyperparameter (class_weight = balanced) was fixed to address the minimal 2:1 class imbalance observed between the No and Yes Recurred categories.
  4. Hyperparameter tuning was conducted using the 5-cycle 5-fold cross-validation method with optimal model performance using the F1 score determined for:
    • C = 1.0
    • kernel = linear
    • gamma = scale
  5. The apparent model performance of the optimal model is summarized as follows:
    • Accuracy = 0.9019
    • Precision = 0.8059
    • Recall = 0.8852
    • F1 Score = 0.8437
    • AUROC = 0.8971
  6. The independent validation model performance of the optimal model is summarized as follows:
    • Accuracy = 0.9130
    • Precision = 0.8181
    • Recall = 0.9000
    • F1 Score = 0.8571
    • AUROC = 0.9091
  7. Sufficiently comparable apparent and independent validation model performance observed that might be indicative of the absence of excessive model overfitting.
In [310]:
##################################
# Defining the categorical preprocessing parameters
##################################
categorical_features = ['Gender','Smoking','Physical_Examination','Adenopathy','Focality','Risk','T','Stage','Response']
categorical_transformer = OrdinalEncoder()
categorical_preprocessor = ColumnTransformer(transformers=[
    ('cat', categorical_transformer, categorical_features)],
                                             remainder='passthrough',
                                             force_int_remainder_cols=False)
In [311]:
##################################
# Defining the preprocessing and modeling pipeline parameters
##################################
stacked_baselearner_svm_pipeline = Pipeline([
    ('categorical_preprocessor', categorical_preprocessor),
    ('stacked_baselearner_svm_model', SVC(class_weight='balanced',
                                          random_state=987654321))
])
In [312]:
##################################
# Defining hyperparameter grid
##################################
stacked_baselearner_svm_hyperparameter_grid = {
    'stacked_baselearner_svm_model__C': [0.1, 1.0],
    'stacked_baselearner_svm_model__kernel': ['linear', 'rbf'],
    'stacked_baselearner_svm_model__gamma': ['scale','auto']
}
In [313]:
##################################
# Defining the cross-validation strategy (5-cycle 5-fold CV)
##################################
cv_strategy = RepeatedStratifiedKFold(n_splits=5, 
                                      n_repeats=5, 
                                      random_state=987654321)
In [314]:
##################################
# Performing Grid Search with cross-validation
##################################
stacked_baselearner_svm_grid_search = GridSearchCV(
    estimator=stacked_baselearner_svm_pipeline,
    param_grid=stacked_baselearner_svm_hyperparameter_grid,
    scoring='f1',
    cv=cv_strategy,
    n_jobs=-1,
    verbose=1
)
In [315]:
##################################
# Encoding the response variables
# for model evaluation
##################################
y_encoder = OrdinalEncoder()
y_encoder.fit(y_preprocessed_train.values.reshape(-1, 1))
y_preprocessed_train_encoded = y_encoder.transform(y_preprocessed_train.values.reshape(-1, 1)).ravel()
y_preprocessed_validation_encoded = y_encoder.transform(y_preprocessed_validation.values.reshape(-1, 1)).ravel()
In [316]:
##################################
# Fitting GridSearchCV
##################################
stacked_baselearner_svm_grid_search.fit(X_preprocessed_train, y_preprocessed_train_encoded)
Fitting 25 folds for each of 8 candidates, totalling 200 fits
Out[316]:
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])])),
                                       ('stacked_baselearner_svm_model',
                                        SVC(class_weight='balanced',
                                            random_state=987654321))]),
             n_jobs=-1,
             param_grid={'stacked_baselearner_svm_model__C': [0.1, 1.0],
                         'stacked_baselearner_svm_model__gamma': ['scale',
                                                                  'auto'],
                         'stacked_baselearner_svm_model__kernel': ['linear',
                                                                   'rbf']},
             scoring='f1', verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])])),
                                       ('stacked_baselearner_svm_model',
                                        SVC(class_weight='balanced',
                                            random_state=987654321))]),
             n_jobs=-1,
             param_grid={'stacked_baselearner_svm_model__C': [0.1, 1.0],
                         'stacked_baselearner_svm_model__gamma': ['scale',
                                                                  'auto'],
                         'stacked_baselearner_svm_model__kernel': ['linear',
                                                                   'rbf']},
             scoring='f1', verbose=1)
Pipeline(steps=[('categorical_preprocessor',
                 ColumnTransformer(force_int_remainder_cols=False,
                                   remainder='passthrough',
                                   transformers=[('cat', OrdinalEncoder(),
                                                  ['Gender', 'Smoking',
                                                   'Physical_Examination',
                                                   'Adenopathy', 'Focality',
                                                   'Risk', 'T', 'Stage',
                                                   'Response'])])),
                ('stacked_baselearner_svm_model',
                 SVC(class_weight='balanced', kernel='linear',
                     random_state=987654321))])
ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough',
                  transformers=[('cat', OrdinalEncoder(),
                                 ['Gender', 'Smoking', 'Physical_Examination',
                                  'Adenopathy', 'Focality', 'Risk', 'T',
                                  'Stage', 'Response'])])
['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response']
OrdinalEncoder()
['Age']
passthrough
SVC(class_weight='balanced', kernel='linear', random_state=987654321)
In [317]:
##################################
# Identifying the best model
##################################
stacked_baselearner_svm_optimal = stacked_baselearner_svm_grid_search.best_estimator_
In [318]:
##################################
# Evaluating the F1 scores
# on the training, cross-validation, and validation data
##################################
stacked_baselearner_svm_optimal_f1_cv = stacked_baselearner_svm_grid_search.best_score_
stacked_baselearner_svm_optimal_f1_train = f1_score(y_preprocessed_train_encoded, stacked_baselearner_svm_optimal.predict(X_preprocessed_train))
stacked_baselearner_svm_optimal_f1_validation = f1_score(y_preprocessed_validation_encoded, stacked_baselearner_svm_optimal.predict(X_preprocessed_validation))
In [319]:
##################################
# Identifying the optimal model
##################################
print('Best Stacked Base Learner SVM: ')
print(f"Best Stacked Base Learner SVM Hyperparameters: {stacked_baselearner_svm_grid_search.best_params_}")
Best Stacked Base Learner SVM: 
Best Stacked Base Learner SVM Hyperparameters: {'stacked_baselearner_svm_model__C': 1.0, 'stacked_baselearner_svm_model__gamma': 'scale', 'stacked_baselearner_svm_model__kernel': 'linear'}
In [320]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the training and cross-validated data
# to assess overfitting optimism
##################################
print(f"F1 Score on Cross-Validated Data: {stacked_baselearner_svm_optimal_f1_cv:.4f}")
print(f"F1 Score on Training Data: {stacked_baselearner_svm_optimal_f1_train:.4f}")
print("\nClassification Report on Train Data:\n", classification_report(y_preprocessed_train_encoded, stacked_baselearner_svm_optimal.predict(X_preprocessed_train)))
F1 Score on Cross-Validated Data: 0.8219
F1 Score on Training Data: 0.8438

Classification Report on Train Data:
               precision    recall  f1-score   support

         0.0       0.95      0.91      0.93       143
         1.0       0.81      0.89      0.84        61

    accuracy                           0.90       204
   macro avg       0.88      0.90      0.89       204
weighted avg       0.91      0.90      0.90       204

In [321]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the train data
##################################
cm_raw = confusion_matrix(y_preprocessed_train_encoded, stacked_baselearner_svm_optimal.predict(X_preprocessed_train))
cm_normalized = confusion_matrix(y_preprocessed_train_encoded, stacked_baselearner_svm_optimal.predict(X_preprocessed_train), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Stacked Base Learner SVM Train Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Stacked Base Learner SVM Train Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [322]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
# to assess overfitting optimism
##################################
print(f"F1 Score on Validation Data: {stacked_baselearner_svm_optimal_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_preprocessed_validation_encoded, stacked_baselearner_svm_optimal.predict(X_preprocessed_validation)))
F1 Score on Validation Data: 0.8571

Classification Report on Validation Data:
               precision    recall  f1-score   support

         0.0       0.96      0.92      0.94        49
         1.0       0.82      0.90      0.86        20

    accuracy                           0.91        69
   macro avg       0.89      0.91      0.90        69
weighted avg       0.92      0.91      0.91        69

In [323]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_preprocessed_validation_encoded, stacked_baselearner_svm_optimal.predict(X_preprocessed_validation))
cm_normalized = confusion_matrix(y_preprocessed_validation_encoded, stacked_baselearner_svm_optimal.predict(X_preprocessed_validation), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Stacked Base Learner SVM Validation Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Stacked Base Learner SVM Validation Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [324]:
##################################
# Gathering the model evaluation metrics
# for the train data
##################################
stacked_baselearner_svm_optimal_train = model_performance_evaluation(y_preprocessed_train_encoded, stacked_baselearner_svm_optimal.predict(X_preprocessed_train))
stacked_baselearner_svm_optimal_train['model'] = ['stacked_baselearner_svm_optimal'] * 5
stacked_baselearner_svm_optimal_train['set'] = ['train'] * 5
print('Optimal Stacked Base Learner SVM Train Performance Metrics: ')
display(stacked_baselearner_svm_optimal_train)
Optimal Stacked Base Learner SVM Train Performance Metrics: 
metric_name metric_value model set
0 Accuracy 0.901961 stacked_baselearner_svm_optimal train
1 Precision 0.805970 stacked_baselearner_svm_optimal train
2 Recall 0.885246 stacked_baselearner_svm_optimal train
3 F1 0.843750 stacked_baselearner_svm_optimal train
4 AUROC 0.897168 stacked_baselearner_svm_optimal train
In [325]:
##################################
# Gathering the model evaluation metrics
# for the validation data
##################################
stacked_baselearner_svm_optimal_validation = model_performance_evaluation(y_preprocessed_validation_encoded, stacked_baselearner_svm_optimal.predict(X_preprocessed_validation))
stacked_baselearner_svm_optimal_validation['model'] = ['stacked_baselearner_svm_optimal'] * 5
stacked_baselearner_svm_optimal_validation['set'] = ['validation'] * 5
print('Optimal Stacked Base Learner SVM Validation Performance Metrics: ')
display(stacked_baselearner_svm_optimal_validation)
Optimal Stacked Base Learner SVM Validation Performance Metrics: 
metric_name metric_value model set
0 Accuracy 0.913043 stacked_baselearner_svm_optimal validation
1 Precision 0.818182 stacked_baselearner_svm_optimal validation
2 Recall 0.900000 stacked_baselearner_svm_optimal validation
3 F1 0.857143 stacked_baselearner_svm_optimal validation
4 AUROC 0.909184 stacked_baselearner_svm_optimal validation
In [326]:
##################################
# Saving the best individual model
# developed from the train data
################################## 
joblib.dump(stacked_baselearner_svm_optimal, 
            os.path.join("..", MODELS_PATH, "stacked_model_baselearner_svm_optimal.pkl"))
Out[326]:
['..\\models\\stacked_model_baselearner_svm_optimal.pkl']

1.9.3 Base Learner - Ridge Classifier ¶

Ridge Classifier is a variation of logistic regression that incorporates L2 regularization to prevent overfitting by penalizing large coefficients in the decision boundary equation. It assumes a linear relationship between the predictor variables and the target class, estimating class probabilities using the logistic function. The key steps include fitting a linear model while adding a penalty term to shrink coefficient values, which reduces variance and improves generalization. Ridge Classifier is particularly useful when dealing with collinear features, as it distributes the importance among correlated variables instead of assigning extreme weights to a few. The advantages of Ridge Classifier include its efficiency, interpretability, and ability to handle high-dimensional data with multicollinearity. However, it has limitations: it assumes a linear decision boundary, making it unsuitable for complex, non-linear relationships, and the regularization parameter requires tuning to balance bias and variance effectively. Additionally, it does not perform feature selection, meaning all input features contribute to the decision-making process, which may reduce interpretability in some cases.

  1. The ridge classifier model from the sklearn.linear_model Python library API was implemented.
  2. The model contains 3 hyperparameters for tuning:
    • alpha = regularization strength made to vary between 1.0 and 2.0
    • solver = solver to use in the computational routines made to vary between sag and saga
    • tol = precision of the solution made to vary between 1e-3 and 1e-4
  3. A special hyperparameter (class_weight = balanced) was fixed to address the minimal 2:1 class imbalance observed between the No and Yes Recurred categories.
  4. Hyperparameter tuning was conducted using the 5-cycle 5-fold cross-validation method with optimal model performance using the F1 score determined for:
    • alpha = 2.0
    • solver = saga
    • tol = 1e-4
  5. The apparent model performance of the optimal model is summarized as follows:
    • Accuracy = 0.8872
    • Precision = 0.7638
    • Recall = 0.9016
    • F1 Score = 0.8270
    • AUROC = 0.8913
  6. The independent validation model performance of the optimal model is summarized as follows:
    • Accuracy = 0.8985
    • Precision = 0.7826
    • Recall = 0.9000
    • F1 Score = 0.8372
    • AUROC = 0.8989
  7. Sufficiently comparable apparent and independent validation model performance observed that might be indicative of the absence of excessive model overfitting.
In [327]:
##################################
# Defining the categorical preprocessing parameters
##################################
categorical_features = ['Gender','Smoking','Physical_Examination','Adenopathy','Focality','Risk','T','Stage','Response']
categorical_transformer = OrdinalEncoder()
categorical_preprocessor = ColumnTransformer(transformers=[
    ('cat', categorical_transformer, categorical_features)],
                                             remainder='passthrough',
                                             force_int_remainder_cols=False)
In [328]:
##################################
# Defining the preprocessing and modeling pipeline parameters
##################################
stacked_baselearner_rc_pipeline = Pipeline([
    ('categorical_preprocessor', categorical_preprocessor),
    ('stacked_baselearner_rc_model', RidgeClassifier(class_weight='balanced',
                                                     random_state=987654321))
])
In [329]:
##################################
# Defining hyperparameter grid
##################################
stacked_baselearner_rc_hyperparameter_grid = {
    'stacked_baselearner_rc_model__alpha': [1.00, 2.00],
    'stacked_baselearner_rc_model__solver': ['sag', 'saga'],
    'stacked_baselearner_rc_model__tol': [1e-3, 1e-4]
}
In [330]:
##################################
# Defining the cross-validation strategy (5-cycle 5-fold CV)
##################################
cv_strategy = RepeatedStratifiedKFold(n_splits=5, 
                                      n_repeats=5, 
                                      random_state=987654321)
In [331]:
##################################
# Performing Grid Search with cross-validation
##################################
stacked_baselearner_rc_grid_search = GridSearchCV(
    estimator=stacked_baselearner_rc_pipeline,
    param_grid=stacked_baselearner_rc_hyperparameter_grid,
    scoring='f1',
    cv=cv_strategy,
    n_jobs=-1,
    verbose=1
)
In [332]:
##################################
# Encoding the response variables
# for model evaluation
##################################
y_encoder = OrdinalEncoder()
y_encoder.fit(y_preprocessed_train.values.reshape(-1, 1))
y_preprocessed_train_encoded = y_encoder.transform(y_preprocessed_train.values.reshape(-1, 1)).ravel()
y_preprocessed_validation_encoded = y_encoder.transform(y_preprocessed_validation.values.reshape(-1, 1)).ravel()
In [333]:
##################################
# Fitting GridSearchCV
##################################
stacked_baselearner_rc_grid_search.fit(X_preprocessed_train, y_preprocessed_train_encoded)
Fitting 25 folds for each of 8 candidates, totalling 200 fits
Out[333]:
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])])),
                                       ('stacked_baselearner_rc_model',
                                        RidgeClassifier(class_weight='balanced',
                                                        random_state=987654321))]),
             n_jobs=-1,
             param_grid={'stacked_baselearner_rc_model__alpha': [1.0, 2.0],
                         'stacked_baselearner_rc_model__solver': ['sag',
                                                                  'saga'],
                         'stacked_baselearner_rc_model__tol': [0.001, 0.0001]},
             scoring='f1', verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])])),
                                       ('stacked_baselearner_rc_model',
                                        RidgeClassifier(class_weight='balanced',
                                                        random_state=987654321))]),
             n_jobs=-1,
             param_grid={'stacked_baselearner_rc_model__alpha': [1.0, 2.0],
                         'stacked_baselearner_rc_model__solver': ['sag',
                                                                  'saga'],
                         'stacked_baselearner_rc_model__tol': [0.001, 0.0001]},
             scoring='f1', verbose=1)
Pipeline(steps=[('categorical_preprocessor',
                 ColumnTransformer(force_int_remainder_cols=False,
                                   remainder='passthrough',
                                   transformers=[('cat', OrdinalEncoder(),
                                                  ['Gender', 'Smoking',
                                                   'Physical_Examination',
                                                   'Adenopathy', 'Focality',
                                                   'Risk', 'T', 'Stage',
                                                   'Response'])])),
                ('stacked_baselearner_rc_model',
                 RidgeClassifier(alpha=2.0, class_weight='balanced',
                                 random_state=987654321, solver='saga'))])
ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough',
                  transformers=[('cat', OrdinalEncoder(),
                                 ['Gender', 'Smoking', 'Physical_Examination',
                                  'Adenopathy', 'Focality', 'Risk', 'T',
                                  'Stage', 'Response'])])
['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response']
OrdinalEncoder()
['Age']
passthrough
RidgeClassifier(alpha=2.0, class_weight='balanced', random_state=987654321,
                solver='saga')
In [334]:
##################################
# Identifying the best model
##################################
stacked_baselearner_rc_optimal = stacked_baselearner_rc_grid_search.best_estimator_
In [335]:
##################################
# Evaluating the F1 scores
# on the training, cross-validation, and validation data
##################################
stacked_baselearner_rc_optimal_f1_cv = stacked_baselearner_rc_grid_search.best_score_
stacked_baselearner_rc_optimal_f1_train = f1_score(y_preprocessed_train_encoded, stacked_baselearner_rc_optimal.predict(X_preprocessed_train))
stacked_baselearner_rc_optimal_f1_validation = f1_score(y_preprocessed_validation_encoded, stacked_baselearner_rc_optimal.predict(X_preprocessed_validation))
In [336]:
##################################
# Identifying the optimal model
##################################
print('Best Stacked Base Learner Ridge Classifier: ')
print(f"Best Stacked Base Learner Ridge Classifier Hyperparameters: {stacked_baselearner_rc_grid_search.best_params_}")
Best Stacked Base Learner Ridge Classifier: 
Best Stacked Base Learner Ridge Classifier Hyperparameters: {'stacked_baselearner_rc_model__alpha': 2.0, 'stacked_baselearner_rc_model__solver': 'saga', 'stacked_baselearner_rc_model__tol': 0.0001}
In [337]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the training and cross-validated data
# to assess overfitting optimism
##################################
print(f"F1 Score on Cross-Validated Data: {stacked_baselearner_rc_optimal_f1_cv:.4f}")
print(f"F1 Score on Training Data: {stacked_baselearner_rc_optimal_f1_train:.4f}")
print("\nClassification Report on Train Data:\n", classification_report(y_preprocessed_train_encoded, stacked_baselearner_rc_optimal.predict(X_preprocessed_train)))
F1 Score on Cross-Validated Data: 0.8097
F1 Score on Training Data: 0.8271

Classification Report on Train Data:
               precision    recall  f1-score   support

         0.0       0.95      0.88      0.92       143
         1.0       0.76      0.90      0.83        61

    accuracy                           0.89       204
   macro avg       0.86      0.89      0.87       204
weighted avg       0.90      0.89      0.89       204

In [338]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the train data
##################################
cm_raw = confusion_matrix(y_preprocessed_train_encoded, stacked_baselearner_rc_optimal.predict(X_preprocessed_train))
cm_normalized = confusion_matrix(y_preprocessed_train_encoded, stacked_baselearner_rc_optimal.predict(X_preprocessed_train), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Stacked Base Learner Ridge Classifier Train Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Stacked Base Learner Ridge Classifier Train Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [339]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
# to assess overfitting optimism
##################################
print(f"F1 Score on Validation Data: {stacked_baselearner_rc_optimal_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_preprocessed_validation_encoded, stacked_baselearner_rc_optimal.predict(X_preprocessed_validation)))
F1 Score on Validation Data: 0.8372

Classification Report on Validation Data:
               precision    recall  f1-score   support

         0.0       0.96      0.90      0.93        49
         1.0       0.78      0.90      0.84        20

    accuracy                           0.90        69
   macro avg       0.87      0.90      0.88        69
weighted avg       0.91      0.90      0.90        69

In [340]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_preprocessed_validation_encoded, stacked_baselearner_rc_optimal.predict(X_preprocessed_validation))
cm_normalized = confusion_matrix(y_preprocessed_validation_encoded, stacked_baselearner_rc_optimal.predict(X_preprocessed_validation), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Stacked Base Learner Ridge Classifier Validation Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Stacked Base Learner Ridge Classifier Validation Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [341]:
##################################
# Gathering the model evaluation metrics
# for the train data
##################################
stacked_baselearner_rc_optimal_train = model_performance_evaluation(y_preprocessed_train_encoded, stacked_baselearner_rc_optimal.predict(X_preprocessed_train))
stacked_baselearner_rc_optimal_train['model'] = ['stacked_baselearner_rc_optimal'] * 5
stacked_baselearner_rc_optimal_train['set'] = ['train'] * 5
print('Optimal Stacked Base Learner Ridge Classifier Train Performance Metrics: ')
display(stacked_baselearner_rc_optimal_train)
Optimal Stacked Base Learner Ridge Classifier Train Performance Metrics: 
metric_name metric_value model set
0 Accuracy 0.887255 stacked_baselearner_rc_optimal train
1 Precision 0.763889 stacked_baselearner_rc_optimal train
2 Recall 0.901639 stacked_baselearner_rc_optimal train
3 F1 0.827068 stacked_baselearner_rc_optimal train
4 AUROC 0.891379 stacked_baselearner_rc_optimal train
In [342]:
##################################
# Gathering the model evaluation metrics
# for the validation data
##################################
stacked_baselearner_rc_optimal_validation = model_performance_evaluation(y_preprocessed_validation_encoded, stacked_baselearner_rc_optimal.predict(X_preprocessed_validation))
stacked_baselearner_rc_optimal_validation['model'] = ['stacked_baselearner_rc_optimal'] * 5
stacked_baselearner_rc_optimal_validation['set'] = ['validation'] * 5
print('Optimal Stacked Base Learner Ridge Classifier Validation Performance Metrics: ')
display(stacked_baselearner_rc_optimal_validation)
Optimal Stacked Base Learner Ridge Classifier Validation Performance Metrics: 
metric_name metric_value model set
0 Accuracy 0.898551 stacked_baselearner_rc_optimal validation
1 Precision 0.782609 stacked_baselearner_rc_optimal validation
2 Recall 0.900000 stacked_baselearner_rc_optimal validation
3 F1 0.837209 stacked_baselearner_rc_optimal validation
4 AUROC 0.898980 stacked_baselearner_rc_optimal validation
In [343]:
##################################
# Saving the best individual model
# developed from the train data
################################## 
joblib.dump(stacked_baselearner_rc_optimal, 
            os.path.join("..", MODELS_PATH, "stacked_model_baselearner_ridge_classifier_optimal.pkl"))
Out[343]:
['..\\models\\stacked_model_baselearner_ridge_classifier_optimal.pkl']

1.9.4 Base Learner - Neural Network ¶

Neural Network is a classification algorithm inspired by the human brain, consisting of layers of interconnected neurons that transform input features through weighted connections and activation functions. It learns patterns in data through backpropagation, where the network adjusts its internal weights to minimize classification error. The process involves an input layer receiving data, multiple hidden layers extracting hierarchical features, and an output layer producing a final prediction. The key advantages of neural networks include their ability to model highly complex, non-linear relationships, making them suitable for image, text, and speech classification tasks. They are also highly scalable, capable of handling massive datasets. However, neural networks have several challenges: they require substantial computational resources, especially for deep architectures; they need large amounts of labeled data for effective training; and they are often difficult to interpret due to their "black box" nature. Additionally, hyperparameter tuning, including choosing the number of layers, neurons, and activation functions, is non-trivial and requires careful optimization to prevent overfitting or underfitting.

  1. The neural network model from the sklearn.neural_network Python library API was implemented.
  2. The model contains 3 hyperparameters for tuning:
    • hidden_layer_sizes = ith element represents the number of neurons in the ith hidden layer made to vary between (50,) and (100,)
    • activation = activation function for the hidden layer made to vary between relu and tanh
    • alpha = strength of the L2 regularization term made to vary between 0.0001 and 0.001
  3. No hyperparameter was defined in the model to address the minimal 2:1 class imbalance observed between the No and Yes Recurred categories.
  4. Hyperparameter tuning was conducted using the 5-cycle 5-fold cross-validation method with optimal model performance using the F1 score determined for:
    • hidden_layer_sizes = (50,)
    • activation = relu
    • alpha = 0.0001
  5. The apparent model performance of the optimal model is summarized as follows:
    • Accuracy = 0.8921
    • Precision = 0.8095
    • Recall = 0.8360
    • F1 Score = 0.8225
    • AUROC = 0.8760
  6. The independent validation model performance of the optimal model is summarized as follows:
    • Accuracy = 0.8840
    • Precision = 0.7727
    • Recall = 0.8500
    • F1 Score = 0.8095
    • AUROC = 0.8739
  7. Sufficiently comparable apparent and independent validation model performance observed that might be indicative of the absence of excessive model overfitting.
In [344]:
##################################
# Defining the categorical preprocessing parameters
##################################
categorical_features = ['Gender','Smoking','Physical_Examination','Adenopathy','Focality','Risk','T','Stage','Response']
categorical_transformer = OrdinalEncoder()
categorical_preprocessor = ColumnTransformer(transformers=[
    ('cat', categorical_transformer, categorical_features)],
                                             remainder='passthrough',
                                             force_int_remainder_cols=False)
In [345]:
##################################
# Defining the preprocessing and modeling pipeline parameters
##################################
stacked_baselearner_nn_pipeline = Pipeline([
    ('categorical_preprocessor', categorical_preprocessor),
    ('stacked_baselearner_nn_model', MLPClassifier(max_iter=500,
                                                   solver='lbfgs',
                                                   early_stopping=False,
                                                   random_state=987654321))
])
In [346]:
##################################
# Defining hyperparameter grid
##################################
stacked_baselearner_nn_hyperparameter_grid = {
    'stacked_baselearner_nn_model__hidden_layer_sizes': [(50,), (100,)],
    'stacked_baselearner_nn_model__activation': ['relu', 'tanh'],
    'stacked_baselearner_nn_model__alpha': [0.0001, 0.001]
}
In [347]:
##################################
# Defining the cross-validation strategy (5-cycle 5-fold CV)
##################################
cv_strategy = RepeatedStratifiedKFold(n_splits=5, 
                                      n_repeats=5, 
                                      random_state=987654321)
In [348]:
##################################
# Performing Grid Search with cross-validation
##################################
stacked_baselearner_nn_grid_search = GridSearchCV(
    estimator=stacked_baselearner_nn_pipeline,
    param_grid=stacked_baselearner_nn_hyperparameter_grid,
    scoring='f1',
    cv=cv_strategy,
    n_jobs=-1,
    verbose=1
)
In [349]:
##################################
# Encoding the response variables
# for model evaluation
##################################
y_encoder = OrdinalEncoder()
y_encoder.fit(y_preprocessed_train.values.reshape(-1, 1))
y_preprocessed_train_encoded = y_encoder.transform(y_preprocessed_train.values.reshape(-1, 1)).ravel()
y_preprocessed_validation_encoded = y_encoder.transform(y_preprocessed_validation.values.reshape(-1, 1)).ravel()
In [350]:
##################################
# Fitting GridSearchCV
##################################
stacked_baselearner_nn_grid_search.fit(X_preprocessed_train, y_preprocessed_train_encoded)
Fitting 25 folds for each of 8 candidates, totalling 200 fits
Out[350]:
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])])),
                                       ('stacked_baselearner_nn_model',
                                        MLPClassifier(max_iter=500,
                                                      random_state=987654321,
                                                      solver='lbfgs'))]),
             n_jobs=-1,
             param_grid={'stacked_baselearner_nn_model__activation': ['relu',
                                                                      'tanh'],
                         'stacked_baselearner_nn_model__alpha': [0.0001, 0.001],
                         'stacked_baselearner_nn_model__hidden_layer_sizes': [(50,),
                                                                              (100,)]},
             scoring='f1', verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])])),
                                       ('stacked_baselearner_nn_model',
                                        MLPClassifier(max_iter=500,
                                                      random_state=987654321,
                                                      solver='lbfgs'))]),
             n_jobs=-1,
             param_grid={'stacked_baselearner_nn_model__activation': ['relu',
                                                                      'tanh'],
                         'stacked_baselearner_nn_model__alpha': [0.0001, 0.001],
                         'stacked_baselearner_nn_model__hidden_layer_sizes': [(50,),
                                                                              (100,)]},
             scoring='f1', verbose=1)
Pipeline(steps=[('categorical_preprocessor',
                 ColumnTransformer(force_int_remainder_cols=False,
                                   remainder='passthrough',
                                   transformers=[('cat', OrdinalEncoder(),
                                                  ['Gender', 'Smoking',
                                                   'Physical_Examination',
                                                   'Adenopathy', 'Focality',
                                                   'Risk', 'T', 'Stage',
                                                   'Response'])])),
                ('stacked_baselearner_nn_model',
                 MLPClassifier(hidden_layer_sizes=(50,), max_iter=500,
                               random_state=987654321, solver='lbfgs'))])
ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough',
                  transformers=[('cat', OrdinalEncoder(),
                                 ['Gender', 'Smoking', 'Physical_Examination',
                                  'Adenopathy', 'Focality', 'Risk', 'T',
                                  'Stage', 'Response'])])
['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response']
OrdinalEncoder()
['Age']
passthrough
MLPClassifier(hidden_layer_sizes=(50,), max_iter=500, random_state=987654321,
              solver='lbfgs')
In [351]:
##################################
# Identifying the best model
##################################
stacked_baselearner_nn_optimal = stacked_baselearner_nn_grid_search.best_estimator_
In [352]:
##################################
# Evaluating the F1 scores
# on the training, cross-validation, and validation data
##################################
stacked_baselearner_nn_optimal_f1_cv = stacked_baselearner_nn_grid_search.best_score_
stacked_baselearner_nn_optimal_f1_train = f1_score(y_preprocessed_train_encoded, stacked_baselearner_nn_optimal.predict(X_preprocessed_train))
stacked_baselearner_nn_optimal_f1_validation = f1_score(y_preprocessed_validation_encoded, stacked_baselearner_nn_optimal.predict(X_preprocessed_validation))
In [353]:
##################################
# Identifying the optimal model
##################################
print('Best Stacked Base Learner Neural Network: ')
print(f"Best Stacked Base Learner Neural Network Hyperparameters: {stacked_baselearner_nn_grid_search.best_params_}")
Best Stacked Base Learner Neural Network: 
Best Stacked Base Learner Neural Network Hyperparameters: {'stacked_baselearner_nn_model__activation': 'relu', 'stacked_baselearner_nn_model__alpha': 0.0001, 'stacked_baselearner_nn_model__hidden_layer_sizes': (50,)}
In [354]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the training and cross-validated data
# to assess overfitting optimism
##################################
print(f"F1 Score on Cross-Validated Data: {stacked_baselearner_nn_optimal_f1_cv:.4f}")
print(f"F1 Score on Training Data: {stacked_baselearner_nn_optimal_f1_train:.4f}")
print("\nClassification Report on Train Data:\n", classification_report(y_preprocessed_train_encoded, stacked_baselearner_nn_optimal.predict(X_preprocessed_train)))
F1 Score on Cross-Validated Data: 0.8063
F1 Score on Training Data: 0.8226

Classification Report on Train Data:
               precision    recall  f1-score   support

         0.0       0.93      0.92      0.92       143
         1.0       0.81      0.84      0.82        61

    accuracy                           0.89       204
   macro avg       0.87      0.88      0.87       204
weighted avg       0.89      0.89      0.89       204

In [355]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the train data
##################################
cm_raw = confusion_matrix(y_preprocessed_train_encoded, stacked_baselearner_nn_optimal.predict(X_preprocessed_train))
cm_normalized = confusion_matrix(y_preprocessed_train_encoded, stacked_baselearner_nn_optimal.predict(X_preprocessed_train), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Stacked Base Learner Neural Network Train Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Stacked Base Learner Neural Network Train Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [356]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
# to assess overfitting optimism
##################################
print(f"F1 Score on Validation Data: {stacked_baselearner_nn_optimal_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_preprocessed_validation_encoded, stacked_baselearner_nn_optimal.predict(X_preprocessed_validation)))
F1 Score on Validation Data: 0.8095

Classification Report on Validation Data:
               precision    recall  f1-score   support

         0.0       0.94      0.90      0.92        49
         1.0       0.77      0.85      0.81        20

    accuracy                           0.88        69
   macro avg       0.85      0.87      0.86        69
weighted avg       0.89      0.88      0.89        69

In [357]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_preprocessed_validation_encoded, stacked_baselearner_nn_optimal.predict(X_preprocessed_validation))
cm_normalized = confusion_matrix(y_preprocessed_validation_encoded, stacked_baselearner_nn_optimal.predict(X_preprocessed_validation), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Stacked Base Learner Neural Network Validation Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Stacked Base Learner Neural Network Validation Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [358]:
##################################
# Gathering the model evaluation metrics
# for the train data
##################################
stacked_baselearner_nn_optimal_train = model_performance_evaluation(y_preprocessed_train_encoded, stacked_baselearner_nn_optimal.predict(X_preprocessed_train))
stacked_baselearner_nn_optimal_train['model'] = ['stacked_baselearner_nn_optimal'] * 5
stacked_baselearner_nn_optimal_train['set'] = ['train'] * 5
print('Optimal Stacked Base Learner Neural Network Train Performance Metrics: ')
display(stacked_baselearner_nn_optimal_train)
Optimal Stacked Base Learner Neural Network Train Performance Metrics: 
metric_name metric_value model set
0 Accuracy 0.892157 stacked_baselearner_nn_optimal train
1 Precision 0.809524 stacked_baselearner_nn_optimal train
2 Recall 0.836066 stacked_baselearner_nn_optimal train
3 F1 0.822581 stacked_baselearner_nn_optimal train
4 AUROC 0.876075 stacked_baselearner_nn_optimal train
In [359]:
##################################
# Gathering the model evaluation metrics
# for the validation data
##################################
stacked_baselearner_nn_optimal_validation = model_performance_evaluation(y_preprocessed_validation_encoded, stacked_baselearner_nn_optimal.predict(X_preprocessed_validation))
stacked_baselearner_nn_optimal_validation['model'] = ['stacked_baselearner_nn_optimal'] * 5
stacked_baselearner_nn_optimal_validation['set'] = ['validation'] * 5
print('Optimal Stacked Base Learner Neural Network Validation Performance Metrics: ')
display(stacked_baselearner_nn_optimal_validation)
Optimal Stacked Base Learner Neural Network Validation Performance Metrics: 
metric_name metric_value model set
0 Accuracy 0.884058 stacked_baselearner_nn_optimal validation
1 Precision 0.772727 stacked_baselearner_nn_optimal validation
2 Recall 0.850000 stacked_baselearner_nn_optimal validation
3 F1 0.809524 stacked_baselearner_nn_optimal validation
4 AUROC 0.873980 stacked_baselearner_nn_optimal validation
In [360]:
##################################
# Saving the best individual model
# developed from the train data
################################## 
joblib.dump(stacked_baselearner_nn_optimal, 
            os.path.join("..", MODELS_PATH, "stacked_model_baselearner_neural_network_optimal.pkl"))
Out[360]:
['..\\models\\stacked_model_baselearner_neural_network_optimal.pkl']

1.9.5 Base Learner - Decision Tree ¶

Decision Tree is a hierarchical classification model that recursively splits data based on feature values, forming a tree-like structure where each node represents a decision rule and each leaf represents a class label. The tree is built using a greedy algorithm that selects the best feature at each step based on criteria such as information gain or Gini impurity. The main advantages of decision trees include their interpretability, as the decision-making process can be easily visualized and understood, and their ability to model non-linear relationships without requiring extensive feature engineering. They also handle both numerical and categorical data well. However, decision trees are prone to overfitting, especially when deep trees are grown without pruning. Small changes in the dataset can lead to entirely different structures, making them unstable. Additionally, they tend to perform poorly on highly complex problems where relationships between variables are intricate, making ensemble methods such as Random Forest or Gradient Boosting more effective in practice.

  1. The decision tree model from the sklearn.tree Python library API was implemented.
  2. The model contains 4 hyperparameters for tuning:
    • criterion = function to measure the quality of a split made to vary between gini and entropy
    • max_depth = maximum depth of the tree made to vary between 3 and 6
    • min_samples_leaf = minimum number of samples required to be at a leaf node made to vary between 5 and 10
  3. A special hyperparameter (class_weight = balanced) was fixed to address the minimal 2:1 class imbalance observed between the No and Yes Recurred categories.
  4. Hyperparameter tuning was conducted using the 5-cycle 5-fold cross-validation method with optimal model performance using the F1 score determined for:
    • criterion = gini
    • max_depth = 6
    • min_samples_leaf = 5
  5. The apparent model performance of the optimal model is summarized as follows:
    • Accuracy = 0.8970
    • Precision = 0.7500
    • Recall = 0.9836
    • F1 Score = 0.8510
    • AUROC = 0.9218
  6. The independent validation model performance of the optimal model is summarized as follows:
    • Accuracy = 0.8550
    • Precision = 0.6666
    • Recall = 1.0000
    • F1 Score = 0.8000
    • AUROC = 0.8979
  7. Sufficiently comparable apparent and independent validation model performance observed that might be indicative of the absence of excessive model overfitting.
In [361]:
##################################
# Defining the categorical preprocessing parameters
##################################
categorical_features = ['Gender','Smoking','Physical_Examination','Adenopathy','Focality','Risk','T','Stage','Response']
categorical_transformer = OrdinalEncoder()
categorical_preprocessor = ColumnTransformer(transformers=[
    ('cat', categorical_transformer, categorical_features)],
                                             remainder='passthrough',
                                             force_int_remainder_cols=False)
In [362]:
##################################
# Defining the preprocessing and modeling pipeline parameters
##################################
stacked_baselearner_dt_pipeline = Pipeline([
    ('categorical_preprocessor', categorical_preprocessor),
    ('stacked_baselearner_dt_model', DecisionTreeClassifier(class_weight='balanced',
                                                            random_state=987654321))
])
In [363]:
##################################
# Defining hyperparameter grid
##################################
stacked_baselearner_dt_hyperparameter_grid = {
    'stacked_baselearner_dt_model__criterion': ['gini', 'entropy'],
    'stacked_baselearner_dt_model__max_depth': [3, 6],
    'stacked_baselearner_dt_model__min_samples_leaf': [5, 10]
}
In [364]:
##################################
# Defining the cross-validation strategy (5-cycle 5-fold CV)
##################################
cv_strategy = RepeatedStratifiedKFold(n_splits=5, 
                                      n_repeats=5, 
                                      random_state=987654321)
In [365]:
##################################
# Performing Grid Search with cross-validation
##################################
stacked_baselearner_dt_grid_search = GridSearchCV(
    estimator=stacked_baselearner_dt_pipeline,
    param_grid=stacked_baselearner_dt_hyperparameter_grid,
    scoring='f1',
    cv=cv_strategy,
    n_jobs=-1,
    verbose=1
)
In [366]:
##################################
# Encoding the response variables
# for model evaluation
##################################
y_encoder = OrdinalEncoder()
y_encoder.fit(y_preprocessed_train.values.reshape(-1, 1))
y_preprocessed_train_encoded = y_encoder.transform(y_preprocessed_train.values.reshape(-1, 1)).ravel()
y_preprocessed_validation_encoded = y_encoder.transform(y_preprocessed_validation.values.reshape(-1, 1)).ravel()
In [367]:
##################################
# Fitting GridSearchCV
##################################
stacked_baselearner_dt_grid_search.fit(X_preprocessed_train, y_preprocessed_train_encoded)
Fitting 25 folds for each of 8 candidates, totalling 200 fits
Out[367]:
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])])),
                                       ('stacked_baselearner_dt_model',
                                        DecisionTreeClassifier(class_weight='balanced',
                                                               random_state=987654321))]),
             n_jobs=-1,
             param_grid={'stacked_baselearner_dt_model__criterion': ['gini',
                                                                     'entropy'],
                         'stacked_baselearner_dt_model__max_depth': [3, 6],
                         'stacked_baselearner_dt_model__min_samples_leaf': [5,
                                                                            10]},
             scoring='f1', verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])])),
                                       ('stacked_baselearner_dt_model',
                                        DecisionTreeClassifier(class_weight='balanced',
                                                               random_state=987654321))]),
             n_jobs=-1,
             param_grid={'stacked_baselearner_dt_model__criterion': ['gini',
                                                                     'entropy'],
                         'stacked_baselearner_dt_model__max_depth': [3, 6],
                         'stacked_baselearner_dt_model__min_samples_leaf': [5,
                                                                            10]},
             scoring='f1', verbose=1)
Pipeline(steps=[('categorical_preprocessor',
                 ColumnTransformer(force_int_remainder_cols=False,
                                   remainder='passthrough',
                                   transformers=[('cat', OrdinalEncoder(),
                                                  ['Gender', 'Smoking',
                                                   'Physical_Examination',
                                                   'Adenopathy', 'Focality',
                                                   'Risk', 'T', 'Stage',
                                                   'Response'])])),
                ('stacked_baselearner_dt_model',
                 DecisionTreeClassifier(class_weight='balanced', max_depth=6,
                                        min_samples_leaf=5,
                                        random_state=987654321))])
ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough',
                  transformers=[('cat', OrdinalEncoder(),
                                 ['Gender', 'Smoking', 'Physical_Examination',
                                  'Adenopathy', 'Focality', 'Risk', 'T',
                                  'Stage', 'Response'])])
['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response']
OrdinalEncoder()
['Age']
passthrough
DecisionTreeClassifier(class_weight='balanced', max_depth=6, min_samples_leaf=5,
                       random_state=987654321)
In [368]:
##################################
# Identifying the best model
##################################
stacked_baselearner_dt_optimal = stacked_baselearner_dt_grid_search.best_estimator_
In [369]:
##################################
# Evaluating the F1 scores
# on the training, cross-validation, and validation data
##################################
stacked_baselearner_dt_optimal_f1_cv = stacked_baselearner_dt_grid_search.best_score_
stacked_baselearner_dt_optimal_f1_train = f1_score(y_preprocessed_train_encoded, stacked_baselearner_dt_optimal.predict(X_preprocessed_train))
stacked_baselearner_dt_optimal_f1_validation = f1_score(y_preprocessed_validation_encoded, stacked_baselearner_dt_optimal.predict(X_preprocessed_validation))
In [370]:
##################################
# Identifying the optimal model
##################################
print('Best Stacked Base Learner Decision Trees: ')
print(f"Best Stacked Base Learner Decision Trees Hyperparameters: {stacked_baselearner_dt_grid_search.best_params_}")
Best Stacked Base Learner Decision Trees: 
Best Stacked Base Learner Decision Trees Hyperparameters: {'stacked_baselearner_dt_model__criterion': 'gini', 'stacked_baselearner_dt_model__max_depth': 6, 'stacked_baselearner_dt_model__min_samples_leaf': 5}
In [371]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the training and cross-validated data
# to assess overfitting optimism
##################################
print(f"F1 Score on Cross-Validated Data: {stacked_baselearner_dt_optimal_f1_cv:.4f}")
print(f"F1 Score on Training Data: {stacked_baselearner_dt_optimal_f1_train:.4f}")
print("\nClassification Report on Train Data:\n", classification_report(y_preprocessed_train_encoded, stacked_baselearner_dt_optimal.predict(X_preprocessed_train)))
F1 Score on Cross-Validated Data: 0.8099
F1 Score on Training Data: 0.8511

Classification Report on Train Data:
               precision    recall  f1-score   support

         0.0       0.99      0.86      0.92       143
         1.0       0.75      0.98      0.85        61

    accuracy                           0.90       204
   macro avg       0.87      0.92      0.89       204
weighted avg       0.92      0.90      0.90       204

In [372]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the train data
##################################
cm_raw = confusion_matrix(y_preprocessed_train_encoded, stacked_baselearner_dt_optimal.predict(X_preprocessed_train))
cm_normalized = confusion_matrix(y_preprocessed_train_encoded, stacked_baselearner_dt_optimal.predict(X_preprocessed_train), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Stacked Base Learner Decision Tree Train Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Stacked Base Learner Decision Tree Train Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [373]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
# to assess overfitting optimism
##################################
print(f"F1 Score on Validation Data: {stacked_baselearner_dt_optimal_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_preprocessed_validation_encoded, stacked_baselearner_dt_optimal.predict(X_preprocessed_validation)))
F1 Score on Validation Data: 0.8000

Classification Report on Validation Data:
               precision    recall  f1-score   support

         0.0       1.00      0.80      0.89        49
         1.0       0.67      1.00      0.80        20

    accuracy                           0.86        69
   macro avg       0.83      0.90      0.84        69
weighted avg       0.90      0.86      0.86        69

In [374]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_preprocessed_validation_encoded, stacked_baselearner_dt_optimal.predict(X_preprocessed_validation))
cm_normalized = confusion_matrix(y_preprocessed_validation_encoded, stacked_baselearner_dt_optimal.predict(X_preprocessed_validation), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Stacked Base Learner Decision Tree Validation Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Stacked Base Learner Decision Tree Validation Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [375]:
##################################
# Gathering the model evaluation metrics
# for the train data
##################################
stacked_baselearner_dt_optimal_train = model_performance_evaluation(y_preprocessed_train_encoded, stacked_baselearner_dt_optimal.predict(X_preprocessed_train))
stacked_baselearner_dt_optimal_train['model'] = ['stacked_baselearner_dt_optimal'] * 5
stacked_baselearner_dt_optimal_train['set'] = ['train'] * 5
print('Optimal Stacked Base Learner Decision Tree Train Performance Metrics: ')
display(stacked_baselearner_dt_optimal_train)
Optimal Stacked Base Learner Decision Tree Train Performance Metrics: 
metric_name metric_value model set
0 Accuracy 0.897059 stacked_baselearner_dt_optimal train
1 Precision 0.750000 stacked_baselearner_dt_optimal train
2 Recall 0.983607 stacked_baselearner_dt_optimal train
3 F1 0.851064 stacked_baselearner_dt_optimal train
4 AUROC 0.921873 stacked_baselearner_dt_optimal train
In [376]:
##################################
# Gathering the model evaluation metrics
# for the validation data
##################################
stacked_baselearner_dt_optimal_validation = model_performance_evaluation(y_preprocessed_validation_encoded, stacked_baselearner_dt_optimal.predict(X_preprocessed_validation))
stacked_baselearner_dt_optimal_validation['model'] = ['stacked_baselearner_dt_optimal'] * 5
stacked_baselearner_dt_optimal_validation['set'] = ['validation'] * 5
print('Optimal Stacked Base Learner Decision Tree Validation Performance Metrics: ')
display(stacked_baselearner_dt_optimal_validation)
Optimal Stacked Base Learner Decision Tree Validation Performance Metrics: 
metric_name metric_value model set
0 Accuracy 0.855072 stacked_baselearner_dt_optimal validation
1 Precision 0.666667 stacked_baselearner_dt_optimal validation
2 Recall 1.000000 stacked_baselearner_dt_optimal validation
3 F1 0.800000 stacked_baselearner_dt_optimal validation
4 AUROC 0.897959 stacked_baselearner_dt_optimal validation
In [377]:
##################################
# Saving the best individual model
# developed from the train data
################################## 
joblib.dump(stacked_baselearner_dt_optimal, 
            os.path.join("..", MODELS_PATH, "stacked_model_baselearner_decision_trees_optimal.pkl"))
Out[377]:
['..\\models\\stacked_model_baselearner_decision_trees_optimal.pkl']

1.9.6 Meta Learner - Logistic Regression ¶

Logistic Regression is a linear classification algorithm that estimates the probability of a binary outcome using the logistic (sigmoid) function. It assumes a linear relationship between the predictor variables and the log-odds of the target class. The algorithm involves calculating a weighted sum of input features, applying the sigmoid function to transform the result into a probability, and assigning a class label based on a threshold (typically 0.5). Logistic regression is simple, interpretable, and computationally efficient, making it a popular choice for baseline models and problems where relationships between features and the target variable are approximately linear. It also provides insight into feature importance through its learned coefficients. However, logistic regression has limitations: it struggles with non-linear relationships unless feature engineering or polynomial terms are used, it is sensitive to multicollinearity, and it assumes independence between predictor variables, which may not always hold in real-world data. Additionally, it may perform poorly when classes are highly imbalanced, requiring techniques such as weighting or resampling to improve predictions.

  1. The logistic regression model from the sklearn.linear_model Python library API was implemented.
  2. The model contains 3 fixed hyperparameters:
    • C = inverse of regularization strength held constant at a value of 1.0
    • penalty = penalty norm held constant at a value of l2
    • solver = algorithm used in the optimization problem held constant at a value of lbfgs
  3. A special hyperparameter (class_weight = balanced) was fixed to address the minimal 2:1 class imbalance observed between the No and Yes Recurred categories.
  4. The apparent model performance of the optimal model is summarized as follows:
    • Accuracy = 0.9068
    • Precision = 0.8088
    • Recall = 0.9016
    • F1 Score = 0.8527
    • AUROC = 0.9053
  5. The independent validation model performance of the optimal model is summarized as follows:
    • Accuracy = 0.9130
    • Precision = 0.8181
    • Recall = 0.9000
    • F1 Score = 0.8571
    • AUROC = 0.9091
  6. Sufficiently comparable apparent and independent validation model performance observed that might be indicative of the absence of excessive model overfitting.
In [378]:
##################################
# Defining the stacking strategy (5-fold CV)
##################################
stacking_strategy = KFold(n_splits=5,
                          shuffle=True,
                          random_state=987654321)
In [379]:
##################################
# Loading the pre-trained base learners
# from the previously saved pickle files
##################################
stacked_baselearners = {}
stacked_baselearner_model = ['knn', 'svm', 'ridge_classifier', 'neural_network', 'decision_trees']
for name in stacked_baselearner_model:
    stacked_baselearner_model_path = (os.path.join("..", MODELS_PATH, f"stacked_model_baselearner_{name}_optimal.pkl"))
    stacked_baselearners[name] = joblib.load(stacked_baselearner_model_path)
        
In [380]:
##################################
# Initializing the meta-feature matrices
##################################
meta_train_stacked = np.zeros((X_preprocessed_train.shape[0], len(stacked_baselearners)))
meta_validation_stacked = np.zeros((X_preprocessed_validation.shape[0], len(stacked_baselearners)))
In [381]:
##################################
# Generating out-of-fold predictions for training the meta learner
##################################
for i, (name, model) in enumerate(stacked_baselearners.items()):
    oof_preds = np.zeros(X_preprocessed_train.shape[0])
    validation_fold_preds = np.zeros((X_preprocessed_validation.shape[0], stacking_strategy.get_n_splits()))

    for j, (train_idx, val_idx) in enumerate(stacking_strategy.split(X_preprocessed_train)):
        model.fit(X_preprocessed_train.iloc[train_idx], y_preprocessed_train_encoded[train_idx])
        oof_preds[val_idx] = model.predict_proba(X_preprocessed_train.iloc[val_idx])[:, 1] if hasattr(model, "predict_proba") else model.predict(X_preprocessed_train.iloc[val_idx])
        validation_fold_preds[:, j] = model.predict_proba(X_preprocessed_validation)[:, 1] if hasattr(model, "predict_proba") else model.predict(X_preprocessed_validation)
        
    # Extracting the meta-feature matrix for the train data
    meta_train_stacked[:, i] = oof_preds
    # Extracting the meta-feature matrix for the validation data
    # Averaging the validation predictions across folds
    meta_validation_stacked[:, i] = validation_fold_preds.mean(axis=1)  
In [382]:
##################################
# Training the meta learner on the stacked features
##################################
stacked_metalearner_lr_optimal = LogisticRegression(class_weight='balanced', 
                                            penalty='l2',
                                            C=1.0,
                                            solver='lbfgs',
                                            random_state=987654321)
stacked_metalearner_lr_optimal.fit(meta_train_stacked, y_preprocessed_train_encoded)
Out[382]:
LogisticRegression(class_weight='balanced', random_state=987654321)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(class_weight='balanced', random_state=987654321)
In [383]:
##################################
# Saving the meta learner model
# developed from the meta-train data
################################## 
joblib.dump(stacked_metalearner_lr_optimal, 
            os.path.join("..", MODELS_PATH, "stacked_model_metalearner_logistic_regression_optimal.pkl"))
Out[383]:
['..\\models\\stacked_model_metalearner_logistic_regression_optimal.pkl']
In [384]:
##################################
# Creating a function to extract the 
# meta-feature matrices for new data
################################## 
def extract_stacked_metafeature_matrix(X_preprocessed_new):
    ##################################
    # Loading the pre-trained base learners
    # from the previously saved pickle files
    ##################################
    stacked_baselearners = {}
    stacked_baselearner_model = ['knn', 'svm', 'ridge_classifier', 'neural_network', 'decision_trees']
    for name in stacked_baselearner_model:
        stacked_baselearner_model_path = (os.path.join("..", MODELS_PATH, f"stacked_model_baselearner_{name}_optimal.pkl"))
        stacked_baselearners[name] = joblib.load(stacked_baselearner_model_path)

    ##################################
    # Generating meta-features for new data
    ##################################
    meta_train_stacked = np.zeros((X_preprocessed_train.shape[0], len(stacked_baselearners)))
    meta_new_stacked = np.zeros((X_preprocessed_new.shape[0], len(stacked_baselearners)))

    ##################################
    # Generating out-of-fold predictions for training the meta learner
    ##################################
    for i, (name, model) in enumerate(stacked_baselearners.items()):
        oof_preds = np.zeros(X_preprocessed_train.shape[0])
        new_fold_preds = np.zeros((X_preprocessed_new.shape[0], stacking_strategy.get_n_splits()))

        for j, (train_idx, val_idx) in enumerate(stacking_strategy.split(X_preprocessed_train)):
            model.fit(X_preprocessed_train.iloc[train_idx], y_preprocessed_train_encoded[train_idx])
            oof_preds[val_idx] = model.predict_proba(X_preprocessed_train.iloc[val_idx])[:, 1] if hasattr(model, "predict_proba") else model.predict(X_preprocessed_train.iloc[val_idx])
            new_fold_preds[:, j] = model.predict_proba(X_preprocessed_new)[:, 1] if hasattr(model, "predict_proba") else model.predict(X_preprocessed_new)
        
        # Extracting the meta-feature matrix for the train data
        meta_train_stacked[:, i] = oof_preds
        # Extracting the meta-feature matrix for the new data
        # Averaging the new predictions across folds
        meta_new_stacked[:, i] = new_fold_preds.mean(axis=1)

    return meta_new_stacked
    
In [385]:
##################################
# Evaluating the F1 scores
# on the training and validation data
##################################
stacked_metalearner_lr_optimal_f1_train = f1_score(y_preprocessed_train_encoded, stacked_metalearner_lr_optimal.predict(extract_stacked_metafeature_matrix(X_preprocessed_train)))
stacked_metalearner_lr_optimal_f1_validation = f1_score(y_preprocessed_validation_encoded, stacked_metalearner_lr_optimal.predict(extract_stacked_metafeature_matrix(X_preprocessed_validation)))
In [386]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the training data
# to assess overfitting optimism
##################################
print(f"F1 Score on Training Data: {stacked_metalearner_lr_optimal_f1_train:.4f}")
print("\nClassification Report on Train Data:\n", classification_report(y_preprocessed_train_encoded, stacked_metalearner_lr_optimal.predict(extract_stacked_metafeature_matrix(X_preprocessed_train))))
F1 Score on Training Data: 0.8527

Classification Report on Train Data:
               precision    recall  f1-score   support

         0.0       0.96      0.91      0.93       143
         1.0       0.81      0.90      0.85        61

    accuracy                           0.91       204
   macro avg       0.88      0.91      0.89       204
weighted avg       0.91      0.91      0.91       204

In [387]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the train data
##################################
cm_raw = confusion_matrix(y_preprocessed_train_encoded, stacked_metalearner_lr_optimal.predict(extract_stacked_metafeature_matrix(X_preprocessed_train)))
cm_normalized = confusion_matrix(y_preprocessed_train_encoded, stacked_metalearner_lr_optimal.predict(extract_stacked_metafeature_matrix(X_preprocessed_train)), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Stacked Meta Learner Logistic Regression Train Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Stacked Meta Learner Logistic Regression Train Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [388]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
# to assess overfitting optimism
##################################
print(f"F1 Score on Validationing Data: {stacked_metalearner_lr_optimal_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_preprocessed_validation_encoded, stacked_metalearner_lr_optimal.predict(extract_stacked_metafeature_matrix(X_preprocessed_validation))))
F1 Score on Validationing Data: 0.8571

Classification Report on Validation Data:
               precision    recall  f1-score   support

         0.0       0.96      0.92      0.94        49
         1.0       0.82      0.90      0.86        20

    accuracy                           0.91        69
   macro avg       0.89      0.91      0.90        69
weighted avg       0.92      0.91      0.91        69

In [389]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_preprocessed_validation_encoded, stacked_metalearner_lr_optimal.predict(extract_stacked_metafeature_matrix(X_preprocessed_validation)))
cm_normalized = confusion_matrix(y_preprocessed_validation_encoded, stacked_metalearner_lr_optimal.predict(extract_stacked_metafeature_matrix(X_preprocessed_validation)), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Stacked Meta Learner Logistic Regression Validation Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Stacked Meta Learner Logistic Regression Validation Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [390]:
##################################
# Gathering the model evaluation metrics
# for the train data
##################################
stacked_metalearner_lr_optimal_train = model_performance_evaluation(y_preprocessed_train_encoded, stacked_metalearner_lr_optimal.predict(extract_stacked_metafeature_matrix(X_preprocessed_train)))
stacked_metalearner_lr_optimal_train['model'] = ['stacked_metalearner_lr_optimal'] * 5
stacked_metalearner_lr_optimal_train['set'] = ['train'] * 5
print('Optimal Stacked Meta Learner Logistic Regression Train Performance Metrics: ')
display(stacked_metalearner_lr_optimal_train)
Optimal Stacked Meta Learner Logistic Regression Train Performance Metrics: 
metric_name metric_value model set
0 Accuracy 0.906863 stacked_metalearner_lr_optimal train
1 Precision 0.808824 stacked_metalearner_lr_optimal train
2 Recall 0.901639 stacked_metalearner_lr_optimal train
3 F1 0.852713 stacked_metalearner_lr_optimal train
4 AUROC 0.905365 stacked_metalearner_lr_optimal train
In [391]:
##################################
# Gathering the model evaluation metrics
# for the validation data
##################################
stacked_metalearner_lr_optimal_validation = model_performance_evaluation(y_preprocessed_validation_encoded, stacked_metalearner_lr_optimal.predict(extract_stacked_metafeature_matrix(X_preprocessed_validation)))
stacked_metalearner_lr_optimal_validation['model'] = ['stacked_metalearner_lr_optimal'] * 5
stacked_metalearner_lr_optimal_validation['set'] = ['validation'] * 5
print('Optimal Stacked Meta Learner Logistic Regression Validation Performance Metrics: ')
display(stacked_metalearner_lr_optimal_validation)
Optimal Stacked Meta Learner Logistic Regression Validation Performance Metrics: 
metric_name metric_value model set
0 Accuracy 0.913043 stacked_metalearner_lr_optimal validation
1 Precision 0.818182 stacked_metalearner_lr_optimal validation
2 Recall 0.900000 stacked_metalearner_lr_optimal validation
3 F1 0.857143 stacked_metalearner_lr_optimal validation
4 AUROC 0.909184 stacked_metalearner_lr_optimal validation

1.10. Blended Model Development ¶

Blending is an ensemble technique that enhances classification accuracy by training a meta-model on a holdout validation set, rather than using out-of-fold predictions like stacking. This simplifies implementation while maintaining the benefits of combining multiple base models. The process of blending starts by training base models on the full training dataset. Instead of applying cross-validation to obtain out-of-fold predictions, blending reserves a small portion of the training data as a holdout set. The base models make predictions on this unseen holdout set, and these predictions are then used as input features for a meta-model, which learns how to optimally combine them into a final classification decision. Since the meta-model is trained on predictions from unseen data, it avoids the risk of overfitting that can sometimes occur when base models are evaluated on the same data they were trained on. Blending is motivated by its simplicity and ease of implementation compared to stacking, as it eliminates the need for repeated k-fold cross-validation to generate training data for the meta-model. However, one drawback is that the meta-model has access to fewer training examples, as a portion of the data is withheld for validation rather than being used for training. This can limit the generalization ability of the final model, especially if the holdout set is too small. Despite this limitation, blending remains a useful approach in applications where a quick and effective ensemble method is needed without the computational overhead of stacking.

1.10.1 Base Learner - K-Nearest Neighbors ¶

K-Nearest Neighbors (KNN) is a non-parametric classification algorithm that makes predictions based on the majority class among the k-nearest training samples in feature space. It does not create an explicit model during training; instead, it stores the entire dataset and computes distances between a query point and all training samples during inference. The algorithm follows three key steps: (1) compute the distance between the query point and all training samples (typically using Euclidean distance), (2) identify the k closest points, and (3) assign the most common class among them as the predicted label. KNN is advantageous because it is simple, requires minimal training time, and can model complex decision boundaries when provided with sufficient data. However, it has significant drawbacks: it is computationally expensive for large datasets since distances must be computed for every prediction, it is sensitive to irrelevant or redundant features, and it requires careful selection of k, as a small k can make the model too sensitive to noise while a large k can overly smooth decision boundaries.

  1. The k-nearest neighbors model from the sklearn.ensemble Python library API was implemented.
  2. The model contains 3 hyperparameters for tuning:
    • n_neighbors = number of neighbors to use made to vary between 3 and 5
    • weights = weight function used in prediction made to vary between uniform and distance
    • metric = metric to use for distance computation made to vary between minkowski and euclidean
  3. No any hyperparameter was defined in the model to address the minimal 2:1 class imbalance observed between the No and Yes Recurred categories.
  4. Hyperparameter tuning was conducted using the 5-cycle 5-fold cross-validation method with optimal model performance using the F1 score determined for:
    • n_neighbors = 3
    • weights = uniform
    • metric = minkowski
  5. The apparent model performance of the optimal model is summarized as follows:
    • Accuracy = 0.9215
    • Precision = 0.9090
    • Recall = 0.8196
    • F1 Score = 0.8620
    • AUROC = 0.8923
  6. The independent validation model performance of the optimal model is summarized as follows:
    • Accuracy = 0.8115
    • Precision = 0.7058
    • Recall = 0.6000
    • F1 Score = 0.6486
    • AUROC = 0.7489
  7. Relatively large difference in apparent and independent validation model performance observed that might be indicative of the presence of moderate model overfitting.
In [392]:
##################################
# Defining the categorical preprocessing parameters
##################################
categorical_features = ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response']
categorical_transformer = OrdinalEncoder()
categorical_preprocessor = ColumnTransformer(transformers=[
    ('cat', categorical_transformer, categorical_features)],
    remainder='passthrough',
    force_int_remainder_cols=False)
In [393]:
##################################
# Defining the preprocessing and modeling pipeline parameters
##################################
blended_baselearner_knn_pipeline = Pipeline([
    ('categorical_preprocessor', categorical_preprocessor),
    ('blended_baselearner_knn_model', KNeighborsClassifier())
])
In [394]:
##################################
# Defining hyperparameter grid
##################################
blended_baselearner_knn_hyperparameter_grid = {
    'blended_baselearner_knn_model__n_neighbors': [3, 5],
    'blended_baselearner_knn_model__weights': ['uniform', 'distance'],
    'blended_baselearner_knn_model__metric': ['minkowski', 'euclidean']
}
In [395]:
##################################
# Defining the cross-validation strategy (5-cycle 5-fold CV)
##################################
cv_strategy = RepeatedStratifiedKFold(n_splits=5, 
                                      n_repeats=5, 
                                      random_state=987654321)
In [396]:
##################################
# Performing Grid Search with cross-validation
##################################
blended_baselearner_knn_grid_search = GridSearchCV(
    estimator=blended_baselearner_knn_pipeline,
    param_grid=blended_baselearner_knn_hyperparameter_grid,
    scoring='f1',
    cv=cv_strategy,
    n_jobs=-1,
    verbose=1
)
In [397]:
##################################
# Encoding the response variables
##################################
y_encoder = OrdinalEncoder()
y_encoder.fit(y_preprocessed_train.values.reshape(-1, 1))
y_preprocessed_train_encoded = y_encoder.transform(y_preprocessed_train.values.reshape(-1, 1)).ravel()
y_preprocessed_validation_encoded = y_encoder.transform(y_preprocessed_validation.values.reshape(-1, 1)).ravel()
In [398]:
##################################
# Fitting GridSearchCV
##################################
blended_baselearner_knn_grid_search.fit(X_preprocessed_train, y_preprocessed_train_encoded)
Fitting 25 folds for each of 8 candidates, totalling 200 fits
Out[398]:
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])])),
                                       ('blended_baselearner_knn_model',
                                        KNeighborsClassifier())]),
             n_jobs=-1,
             param_grid={'blended_baselearner_knn_model__metric': ['minkowski',
                                                                   'euclidean'],
                         'blended_baselearner_knn_model__n_neighbors': [3, 5],
                         'blended_baselearner_knn_model__weights': ['uniform',
                                                                    'distance']},
             scoring='f1', verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])])),
                                       ('blended_baselearner_knn_model',
                                        KNeighborsClassifier())]),
             n_jobs=-1,
             param_grid={'blended_baselearner_knn_model__metric': ['minkowski',
                                                                   'euclidean'],
                         'blended_baselearner_knn_model__n_neighbors': [3, 5],
                         'blended_baselearner_knn_model__weights': ['uniform',
                                                                    'distance']},
             scoring='f1', verbose=1)
Pipeline(steps=[('categorical_preprocessor',
                 ColumnTransformer(force_int_remainder_cols=False,
                                   remainder='passthrough',
                                   transformers=[('cat', OrdinalEncoder(),
                                                  ['Gender', 'Smoking',
                                                   'Physical_Examination',
                                                   'Adenopathy', 'Focality',
                                                   'Risk', 'T', 'Stage',
                                                   'Response'])])),
                ('blended_baselearner_knn_model',
                 KNeighborsClassifier(n_neighbors=3))])
ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough',
                  transformers=[('cat', OrdinalEncoder(),
                                 ['Gender', 'Smoking', 'Physical_Examination',
                                  'Adenopathy', 'Focality', 'Risk', 'T',
                                  'Stage', 'Response'])])
['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response']
OrdinalEncoder()
['Age']
passthrough
KNeighborsClassifier(n_neighbors=3)
In [399]:
##################################
# Identifying the best model
##################################
blended_baselearner_knn_optimal = blended_baselearner_knn_grid_search.best_estimator_
In [400]:
##################################
# Evaluating the F1 scores
# on the training, cross-validation, and validation data
##################################
blended_baselearner_knn_optimal_f1_cv = blended_baselearner_knn_grid_search.best_score_
blended_baselearner_knn_optimal_f1_train = f1_score(y_preprocessed_train_encoded, blended_baselearner_knn_optimal.predict(X_preprocessed_train))
blended_baselearner_knn_optimal_f1_validation = f1_score(y_preprocessed_validation_encoded, blended_baselearner_knn_optimal.predict(X_preprocessed_validation))
In [401]:
##################################
# Identifying the optimal model
##################################
print('Best Blended Base Learner KNN: ')
print(f"Best Blended Base Learner KNN Hyperparameters: {blended_baselearner_knn_grid_search.best_params_}")
Best Blended Base Learner KNN: 
Best Blended Base Learner KNN Hyperparameters: {'blended_baselearner_knn_model__metric': 'minkowski', 'blended_baselearner_knn_model__n_neighbors': 3, 'blended_baselearner_knn_model__weights': 'uniform'}
In [402]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the training and cross-validated data
# to assess overfitting optimism
##################################
print(f"F1 Score on Cross-Validated Data: {blended_baselearner_knn_optimal_f1_cv:.4f}")
print(f"F1 Score on Training Data: {blended_baselearner_knn_optimal_f1_train:.4f}")
print("\nClassification Report on Train Data:\n", classification_report(y_preprocessed_train_encoded, blended_baselearner_knn_optimal.predict(X_preprocessed_train)))
F1 Score on Cross-Validated Data: 0.6417
F1 Score on Training Data: 0.8621

Classification Report on Train Data:
               precision    recall  f1-score   support

         0.0       0.93      0.97      0.95       143
         1.0       0.91      0.82      0.86        61

    accuracy                           0.92       204
   macro avg       0.92      0.89      0.90       204
weighted avg       0.92      0.92      0.92       204

In [403]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the train data
##################################
cm_raw = confusion_matrix(y_preprocessed_train_encoded, blended_baselearner_knn_optimal.predict(X_preprocessed_train))
cm_normalized = confusion_matrix(y_preprocessed_train_encoded, blended_baselearner_knn_optimal.predict(X_preprocessed_train), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Blended Base Learner KNN Train Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Blended Base Learner KNN Train Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [404]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
# to assess overfitting optimism
##################################
print(f"F1 Score on Validation Data: {blended_baselearner_knn_optimal_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_preprocessed_validation_encoded, blended_baselearner_knn_optimal.predict(X_preprocessed_validation)))
F1 Score on Validation Data: 0.6486

Classification Report on Validation Data:
               precision    recall  f1-score   support

         0.0       0.85      0.90      0.87        49
         1.0       0.71      0.60      0.65        20

    accuracy                           0.81        69
   macro avg       0.78      0.75      0.76        69
weighted avg       0.81      0.81      0.81        69

In [405]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_preprocessed_validation_encoded, blended_baselearner_knn_optimal.predict(X_preprocessed_validation))
cm_normalized = confusion_matrix(y_preprocessed_validation_encoded, blended_baselearner_knn_optimal.predict(X_preprocessed_validation), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Blended Base Learner KNN Validation Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Blended Base Learner KNN Validation Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [406]:
##################################
# Gathering the model evaluation metrics
# for the train data
##################################
blended_baselearner_knn_optimal_train = model_performance_evaluation(y_preprocessed_train_encoded, blended_baselearner_knn_optimal.predict(X_preprocessed_train))
blended_baselearner_knn_optimal_train['model'] = ['blended_baselearner_knn_optimal'] * 5
blended_baselearner_knn_optimal_train['set'] = ['train'] * 5
print('Optimal Blended Base Learner KNN Train Performance Metrics: ')
display(blended_baselearner_knn_optimal_train)
Optimal Blended Base Learner KNN Train Performance Metrics: 
metric_name metric_value model set
0 Accuracy 0.921569 blended_baselearner_knn_optimal train
1 Precision 0.909091 blended_baselearner_knn_optimal train
2 Recall 0.819672 blended_baselearner_knn_optimal train
3 F1 0.862069 blended_baselearner_knn_optimal train
4 AUROC 0.892354 blended_baselearner_knn_optimal train
In [407]:
##################################
# Gathering the model evaluation metrics
# for the validation data
##################################
blended_baselearner_knn_optimal_validation = model_performance_evaluation(y_preprocessed_validation_encoded, blended_baselearner_knn_optimal.predict(X_preprocessed_validation))
blended_baselearner_knn_optimal_validation['model'] = ['blended_baselearner_knn_optimal'] * 5
blended_baselearner_knn_optimal_validation['set'] = ['validation'] * 5
print('Optimal Blended Base Learner KNN Validation Performance Metrics: ')
display(blended_baselearner_knn_optimal_validation)
Optimal Blended Base Learner KNN Validation Performance Metrics: 
metric_name metric_value model set
0 Accuracy 0.811594 blended_baselearner_knn_optimal validation
1 Precision 0.705882 blended_baselearner_knn_optimal validation
2 Recall 0.600000 blended_baselearner_knn_optimal validation
3 F1 0.648649 blended_baselearner_knn_optimal validation
4 AUROC 0.748980 blended_baselearner_knn_optimal validation
In [408]:
##################################
# Saving the best individual model
# developed from the train data
################################## 
joblib.dump(blended_baselearner_knn_optimal, 
            os.path.join("..", MODELS_PATH, "blended_model_baselearner_knn_optimal.pkl"))
Out[408]:
['..\\models\\blended_model_baselearner_knn_optimal.pkl']

1.10.2 Base Learner - Support Vector Machine ¶

Support Vector Machine (SVM) is a powerful classification algorithm that finds an optimal decision boundary — called a hyperplane — that maximizes the margin between two classes. The algorithm works by identifying the most influential data points, known as support vectors, that define this boundary. If the data is not linearly separable, SVM can use kernel functions to map it into a higher-dimensional space where separation is possible. The main advantages of SVM include strong theoretical guarantees, effectiveness in high-dimensional spaces, and robustness against overfitting when properly regularized. It performs well when the margin between classes is clear and works effectively with small to medium-sized datasets. However, SVM has notable limitations: it is computationally expensive, making it impractical for very large datasets; it requires careful tuning of hyperparameters such as the kernel type and regularization strength; and it is not easily interpretable, as decision boundaries in high-dimensional space can be difficult to visualize.

  1. The support vector machine model from the sklearn.svm Python library API was implemented.
  2. The model contains 3 hyperparameters for tuning:
    • C = inverse of regularization strength made to vary between 0.1 and 1.0
    • kernel = kernel type to be used in the algorithm made to vary between linear and rbf
    • gamma = kernel coefficient made to vary between scale and auto
  3. A special hyperparameter (class_weight = balanced) was fixed to address the minimal 2:1 class imbalance observed between the No and Yes Recurred categories.
  4. Hyperparameter tuning was conducted using the 5-cycle 5-fold cross-validation method with optimal model performance using the F1 score determined for:
    • C = 1.0
    • kernel = linear
    • gamma = scale
  5. The apparent model performance of the optimal model is summarized as follows:
    • Accuracy = 0.9019
    • Precision = 0.8059
    • Recall = 0.8852
    • F1 Score = 0.8437
    • AUROC = 0.8971
  6. The independent validation model performance of the optimal model is summarized as follows:
    • Accuracy = 0.9130
    • Precision = 0.8181
    • Recall = 0.9000
    • F1 Score = 0.8571
    • AUROC = 0.9091
  7. Sufficiently comparable apparent and independent validation model performance observed that might be indicative of the absence of excessive model overfitting.
In [409]:
##################################
# Defining the categorical preprocessing parameters
##################################
categorical_features = ['Gender','Smoking','Physical_Examination','Adenopathy','Focality','Risk','T','Stage','Response']
categorical_transformer = OrdinalEncoder()
categorical_preprocessor = ColumnTransformer(transformers=[
    ('cat', categorical_transformer, categorical_features)],
                                             remainder='passthrough',
                                             force_int_remainder_cols=False)
In [410]:
##################################
# Defining the preprocessing and modeling pipeline parameters
##################################
blended_baselearner_svm_pipeline = Pipeline([
    ('categorical_preprocessor', categorical_preprocessor),
    ('blended_baselearner_svm_model', SVC(class_weight='balanced',
                                          random_state=987654321))
])
In [411]:
##################################
# Defining hyperparameter grid
##################################
blended_baselearner_svm_hyperparameter_grid = {
    'blended_baselearner_svm_model__C': [0.1, 1.0],
    'blended_baselearner_svm_model__kernel': ['linear', 'rbf'],
    'blended_baselearner_svm_model__gamma': ['scale','auto']
}
In [412]:
##################################
# Defining the cross-validation strategy (5-cycle 5-fold CV)
##################################
cv_strategy = RepeatedStratifiedKFold(n_splits=5, 
                                      n_repeats=5, 
                                      random_state=987654321)
In [413]:
##################################
# Performing Grid Search with cross-validation
##################################
blended_baselearner_svm_grid_search = GridSearchCV(
    estimator=blended_baselearner_svm_pipeline,
    param_grid=blended_baselearner_svm_hyperparameter_grid,
    scoring='f1',
    cv=cv_strategy,
    n_jobs=-1,
    verbose=1
)
In [414]:
##################################
# Encoding the response variables
# for model evaluation
##################################
y_encoder = OrdinalEncoder()
y_encoder.fit(y_preprocessed_train.values.reshape(-1, 1))
y_preprocessed_train_encoded = y_encoder.transform(y_preprocessed_train.values.reshape(-1, 1)).ravel()
y_preprocessed_validation_encoded = y_encoder.transform(y_preprocessed_validation.values.reshape(-1, 1)).ravel()
In [415]:
##################################
# Fitting GridSearchCV
##################################
blended_baselearner_svm_grid_search.fit(X_preprocessed_train, y_preprocessed_train_encoded)
Fitting 25 folds for each of 8 candidates, totalling 200 fits
Out[415]:
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])])),
                                       ('blended_baselearner_svm_model',
                                        SVC(class_weight='balanced',
                                            random_state=987654321))]),
             n_jobs=-1,
             param_grid={'blended_baselearner_svm_model__C': [0.1, 1.0],
                         'blended_baselearner_svm_model__gamma': ['scale',
                                                                  'auto'],
                         'blended_baselearner_svm_model__kernel': ['linear',
                                                                   'rbf']},
             scoring='f1', verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])])),
                                       ('blended_baselearner_svm_model',
                                        SVC(class_weight='balanced',
                                            random_state=987654321))]),
             n_jobs=-1,
             param_grid={'blended_baselearner_svm_model__C': [0.1, 1.0],
                         'blended_baselearner_svm_model__gamma': ['scale',
                                                                  'auto'],
                         'blended_baselearner_svm_model__kernel': ['linear',
                                                                   'rbf']},
             scoring='f1', verbose=1)
Pipeline(steps=[('categorical_preprocessor',
                 ColumnTransformer(force_int_remainder_cols=False,
                                   remainder='passthrough',
                                   transformers=[('cat', OrdinalEncoder(),
                                                  ['Gender', 'Smoking',
                                                   'Physical_Examination',
                                                   'Adenopathy', 'Focality',
                                                   'Risk', 'T', 'Stage',
                                                   'Response'])])),
                ('blended_baselearner_svm_model',
                 SVC(class_weight='balanced', kernel='linear',
                     random_state=987654321))])
ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough',
                  transformers=[('cat', OrdinalEncoder(),
                                 ['Gender', 'Smoking', 'Physical_Examination',
                                  'Adenopathy', 'Focality', 'Risk', 'T',
                                  'Stage', 'Response'])])
['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response']
OrdinalEncoder()
['Age']
passthrough
SVC(class_weight='balanced', kernel='linear', random_state=987654321)
In [416]:
##################################
# Identifying the best model
##################################
blended_baselearner_svm_optimal = blended_baselearner_svm_grid_search.best_estimator_
In [417]:
##################################
# Evaluating the F1 scores
# on the training, cross-validation, and validation data
##################################
blended_baselearner_svm_optimal_f1_cv = blended_baselearner_svm_grid_search.best_score_
blended_baselearner_svm_optimal_f1_train = f1_score(y_preprocessed_train_encoded, blended_baselearner_svm_optimal.predict(X_preprocessed_train))
blended_baselearner_svm_optimal_f1_validation = f1_score(y_preprocessed_validation_encoded, blended_baselearner_svm_optimal.predict(X_preprocessed_validation))
In [418]:
##################################
# Identifying the optimal model
##################################
print('Best Blended Base Learner SVM: ')
print(f"Best Blended Base Learner SVM Hyperparameters: {blended_baselearner_svm_grid_search.best_params_}")
Best Blended Base Learner SVM: 
Best Blended Base Learner SVM Hyperparameters: {'blended_baselearner_svm_model__C': 1.0, 'blended_baselearner_svm_model__gamma': 'scale', 'blended_baselearner_svm_model__kernel': 'linear'}
In [419]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the training and cross-validated data
# to assess overfitting optimism
##################################
print(f"F1 Score on Cross-Validated Data: {blended_baselearner_svm_optimal_f1_cv:.4f}")
print(f"F1 Score on Training Data: {blended_baselearner_svm_optimal_f1_train:.4f}")
print("\nClassification Report on Train Data:\n", classification_report(y_preprocessed_train_encoded, blended_baselearner_svm_optimal.predict(X_preprocessed_train)))
F1 Score on Cross-Validated Data: 0.8219
F1 Score on Training Data: 0.8438

Classification Report on Train Data:
               precision    recall  f1-score   support

         0.0       0.95      0.91      0.93       143
         1.0       0.81      0.89      0.84        61

    accuracy                           0.90       204
   macro avg       0.88      0.90      0.89       204
weighted avg       0.91      0.90      0.90       204

In [420]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the train data
##################################
cm_raw = confusion_matrix(y_preprocessed_train_encoded, blended_baselearner_svm_optimal.predict(X_preprocessed_train))
cm_normalized = confusion_matrix(y_preprocessed_train_encoded, blended_baselearner_svm_optimal.predict(X_preprocessed_train), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Blended Base Learner SVM Train Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Blended Base Learner SVM Train Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [421]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
# to assess overfitting optimism
##################################
print(f"F1 Score on Validation Data: {blended_baselearner_svm_optimal_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_preprocessed_validation_encoded, blended_baselearner_svm_optimal.predict(X_preprocessed_validation)))
F1 Score on Validation Data: 0.8571

Classification Report on Validation Data:
               precision    recall  f1-score   support

         0.0       0.96      0.92      0.94        49
         1.0       0.82      0.90      0.86        20

    accuracy                           0.91        69
   macro avg       0.89      0.91      0.90        69
weighted avg       0.92      0.91      0.91        69

In [422]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_preprocessed_validation_encoded, blended_baselearner_svm_optimal.predict(X_preprocessed_validation))
cm_normalized = confusion_matrix(y_preprocessed_validation_encoded, blended_baselearner_svm_optimal.predict(X_preprocessed_validation), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Blended Base Learner SVM Validation Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Blended Base Learner SVM Validation Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [423]:
##################################
# Gathering the model evaluation metrics
# for the train data
##################################
blended_baselearner_svm_optimal_train = model_performance_evaluation(y_preprocessed_train_encoded, blended_baselearner_svm_optimal.predict(X_preprocessed_train))
blended_baselearner_svm_optimal_train['model'] = ['blended_baselearner_svm_optimal'] * 5
blended_baselearner_svm_optimal_train['set'] = ['train'] * 5
print('Optimal Blended Base Learner SVM Train Performance Metrics: ')
display(blended_baselearner_svm_optimal_train)
Optimal Blended Base Learner SVM Train Performance Metrics: 
metric_name metric_value model set
0 Accuracy 0.901961 blended_baselearner_svm_optimal train
1 Precision 0.805970 blended_baselearner_svm_optimal train
2 Recall 0.885246 blended_baselearner_svm_optimal train
3 F1 0.843750 blended_baselearner_svm_optimal train
4 AUROC 0.897168 blended_baselearner_svm_optimal train
In [424]:
##################################
# Gathering the model evaluation metrics
# for the validation data
##################################
blended_baselearner_svm_optimal_validation = model_performance_evaluation(y_preprocessed_validation_encoded, blended_baselearner_svm_optimal.predict(X_preprocessed_validation))
blended_baselearner_svm_optimal_validation['model'] = ['blended_baselearner_svm_optimal'] * 5
blended_baselearner_svm_optimal_validation['set'] = ['validation'] * 5
print('Optimal Blended Base Learner SVM Validation Performance Metrics: ')
display(blended_baselearner_svm_optimal_validation)
Optimal Blended Base Learner SVM Validation Performance Metrics: 
metric_name metric_value model set
0 Accuracy 0.913043 blended_baselearner_svm_optimal validation
1 Precision 0.818182 blended_baselearner_svm_optimal validation
2 Recall 0.900000 blended_baselearner_svm_optimal validation
3 F1 0.857143 blended_baselearner_svm_optimal validation
4 AUROC 0.909184 blended_baselearner_svm_optimal validation
In [425]:
##################################
# Saving the best individual model
# developed from the train data
################################## 
joblib.dump(blended_baselearner_svm_optimal, 
            os.path.join("..", MODELS_PATH, "blended_model_baselearner_svm_optimal.pkl"))
Out[425]:
['..\\models\\blended_model_baselearner_svm_optimal.pkl']

1.10.3 Base Learner - Ridge Classifier ¶

Ridge Classifier is a variation of logistic regression that incorporates L2 regularization to prevent overfitting by penalizing large coefficients in the decision boundary equation. It assumes a linear relationship between the predictor variables and the target class, estimating class probabilities using the logistic function. The key steps include fitting a linear model while adding a penalty term to shrink coefficient values, which reduces variance and improves generalization. Ridge Classifier is particularly useful when dealing with collinear features, as it distributes the importance among correlated variables instead of assigning extreme weights to a few. The advantages of Ridge Classifier include its efficiency, interpretability, and ability to handle high-dimensional data with multicollinearity. However, it has limitations: it assumes a linear decision boundary, making it unsuitable for complex, non-linear relationships, and the regularization parameter requires tuning to balance bias and variance effectively. Additionally, it does not perform feature selection, meaning all input features contribute to the decision-making process, which may reduce interpretability in some cases.

  1. The ridge classifier model from the sklearn.linear_model Python library API was implemented.
  2. The model contains 3 hyperparameters for tuning:
    • alpha = regularization strength made to vary between 1.0 and 2.0
    • solver = solver to use in the computational routines made to vary between sag and saga
    • tol = precision of the solution made to vary between 1e-3 and 1e-4
  3. A special hyperparameter (class_weight = balanced) was fixed to address the minimal 2:1 class imbalance observed between the No and Yes Recurred categories.
  4. Hyperparameter tuning was conducted using the 5-cycle 5-fold cross-validation method with optimal model performance using the F1 score determined for:
    • alpha = 2.0
    • solver = saga
    • tol = 1e-4
  5. The apparent model performance of the optimal model is summarized as follows:
    • Accuracy = 0.8872
    • Precision = 0.7638
    • Recall = 0.9016
    • F1 Score = 0.8270
    • AUROC = 0.8913
  6. The independent validation model performance of the optimal model is summarized as follows:
    • Accuracy = 0.8985
    • Precision = 0.7826
    • Recall = 0.9000
    • F1 Score = 0.8372
    • AUROC = 0.8989
  7. Sufficiently comparable apparent and independent validation model performance observed that might be indicative of the absence of excessive model overfitting.
In [426]:
##################################
# Defining the categorical preprocessing parameters
##################################
categorical_features = ['Gender','Smoking','Physical_Examination','Adenopathy','Focality','Risk','T','Stage','Response']
categorical_transformer = OrdinalEncoder()
categorical_preprocessor = ColumnTransformer(transformers=[
    ('cat', categorical_transformer, categorical_features)],
                                             remainder='passthrough',
                                             force_int_remainder_cols=False)
In [427]:
##################################
# Defining the preprocessing and modeling pipeline parameters
##################################
blended_baselearner_rc_pipeline = Pipeline([
    ('categorical_preprocessor', categorical_preprocessor),
    ('blended_baselearner_rc_model', RidgeClassifier(class_weight='balanced',
                                                     random_state=987654321))
])
In [428]:
##################################
# Defining hyperparameter grid
##################################
blended_baselearner_rc_hyperparameter_grid = {
    'blended_baselearner_rc_model__alpha': [1.00, 2.00],
    'blended_baselearner_rc_model__solver': ['sag', 'saga'],
    'blended_baselearner_rc_model__tol': [1e-3, 1e-4]
}
In [429]:
##################################
# Defining the cross-validation strategy (5-cycle 5-fold CV)
##################################
cv_strategy = RepeatedStratifiedKFold(n_splits=5, 
                                      n_repeats=5, 
                                      random_state=987654321)
In [430]:
##################################
# Performing Grid Search with cross-validation
##################################
blended_baselearner_rc_grid_search = GridSearchCV(
    estimator=blended_baselearner_rc_pipeline,
    param_grid=blended_baselearner_rc_hyperparameter_grid,
    scoring='f1',
    cv=cv_strategy,
    n_jobs=-1,
    verbose=1
)
In [431]:
##################################
# Encoding the response variables
# for model evaluation
##################################
y_encoder = OrdinalEncoder()
y_encoder.fit(y_preprocessed_train.values.reshape(-1, 1))
y_preprocessed_train_encoded = y_encoder.transform(y_preprocessed_train.values.reshape(-1, 1)).ravel()
y_preprocessed_validation_encoded = y_encoder.transform(y_preprocessed_validation.values.reshape(-1, 1)).ravel()
In [432]:
##################################
# Fitting GridSearchCV
##################################
blended_baselearner_rc_grid_search.fit(X_preprocessed_train, y_preprocessed_train_encoded)
Fitting 25 folds for each of 8 candidates, totalling 200 fits
Out[432]:
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])])),
                                       ('blended_baselearner_rc_model',
                                        RidgeClassifier(class_weight='balanced',
                                                        random_state=987654321))]),
             n_jobs=-1,
             param_grid={'blended_baselearner_rc_model__alpha': [1.0, 2.0],
                         'blended_baselearner_rc_model__solver': ['sag',
                                                                  'saga'],
                         'blended_baselearner_rc_model__tol': [0.001, 0.0001]},
             scoring='f1', verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])])),
                                       ('blended_baselearner_rc_model',
                                        RidgeClassifier(class_weight='balanced',
                                                        random_state=987654321))]),
             n_jobs=-1,
             param_grid={'blended_baselearner_rc_model__alpha': [1.0, 2.0],
                         'blended_baselearner_rc_model__solver': ['sag',
                                                                  'saga'],
                         'blended_baselearner_rc_model__tol': [0.001, 0.0001]},
             scoring='f1', verbose=1)
Pipeline(steps=[('categorical_preprocessor',
                 ColumnTransformer(force_int_remainder_cols=False,
                                   remainder='passthrough',
                                   transformers=[('cat', OrdinalEncoder(),
                                                  ['Gender', 'Smoking',
                                                   'Physical_Examination',
                                                   'Adenopathy', 'Focality',
                                                   'Risk', 'T', 'Stage',
                                                   'Response'])])),
                ('blended_baselearner_rc_model',
                 RidgeClassifier(alpha=2.0, class_weight='balanced',
                                 random_state=987654321, solver='saga'))])
ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough',
                  transformers=[('cat', OrdinalEncoder(),
                                 ['Gender', 'Smoking', 'Physical_Examination',
                                  'Adenopathy', 'Focality', 'Risk', 'T',
                                  'Stage', 'Response'])])
['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response']
OrdinalEncoder()
['Age']
passthrough
RidgeClassifier(alpha=2.0, class_weight='balanced', random_state=987654321,
                solver='saga')
In [433]:
##################################
# Identifying the best model
##################################
blended_baselearner_rc_optimal = blended_baselearner_rc_grid_search.best_estimator_
In [434]:
##################################
# Evaluating the F1 scores
# on the training, cross-validation, and validation data
##################################
blended_baselearner_rc_optimal_f1_cv = blended_baselearner_rc_grid_search.best_score_
blended_baselearner_rc_optimal_f1_train = f1_score(y_preprocessed_train_encoded, blended_baselearner_rc_optimal.predict(X_preprocessed_train))
blended_baselearner_rc_optimal_f1_validation = f1_score(y_preprocessed_validation_encoded, blended_baselearner_rc_optimal.predict(X_preprocessed_validation))
In [435]:
##################################
# Identifying the optimal model
##################################
print('Best Blended Base Learner Ridge Classifier: ')
print(f"Best Blended Base Learner Ridge Classifier Hyperparameters: {blended_baselearner_rc_grid_search.best_params_}")
Best Blended Base Learner Ridge Classifier: 
Best Blended Base Learner Ridge Classifier Hyperparameters: {'blended_baselearner_rc_model__alpha': 2.0, 'blended_baselearner_rc_model__solver': 'saga', 'blended_baselearner_rc_model__tol': 0.0001}
In [436]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the training and cross-validated data
# to assess overfitting optimism
##################################
print(f"F1 Score on Cross-Validated Data: {blended_baselearner_rc_optimal_f1_cv:.4f}")
print(f"F1 Score on Training Data: {blended_baselearner_rc_optimal_f1_train:.4f}")
print("\nClassification Report on Train Data:\n", classification_report(y_preprocessed_train_encoded, blended_baselearner_rc_optimal.predict(X_preprocessed_train)))
F1 Score on Cross-Validated Data: 0.8097
F1 Score on Training Data: 0.8271

Classification Report on Train Data:
               precision    recall  f1-score   support

         0.0       0.95      0.88      0.92       143
         1.0       0.76      0.90      0.83        61

    accuracy                           0.89       204
   macro avg       0.86      0.89      0.87       204
weighted avg       0.90      0.89      0.89       204

In [437]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the train data
##################################
cm_raw = confusion_matrix(y_preprocessed_train_encoded, blended_baselearner_rc_optimal.predict(X_preprocessed_train))
cm_normalized = confusion_matrix(y_preprocessed_train_encoded, blended_baselearner_rc_optimal.predict(X_preprocessed_train), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Blended Base Learner Ridge Classifier Train Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Blended Base Learner Ridge Classifier Train Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [438]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
# to assess overfitting optimism
##################################
print(f"F1 Score on Validation Data: {blended_baselearner_rc_optimal_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_preprocessed_validation_encoded, blended_baselearner_rc_optimal.predict(X_preprocessed_validation)))
F1 Score on Validation Data: 0.8372

Classification Report on Validation Data:
               precision    recall  f1-score   support

         0.0       0.96      0.90      0.93        49
         1.0       0.78      0.90      0.84        20

    accuracy                           0.90        69
   macro avg       0.87      0.90      0.88        69
weighted avg       0.91      0.90      0.90        69

In [439]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_preprocessed_validation_encoded, blended_baselearner_rc_optimal.predict(X_preprocessed_validation))
cm_normalized = confusion_matrix(y_preprocessed_validation_encoded, blended_baselearner_rc_optimal.predict(X_preprocessed_validation), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Blended Base Learner Ridge Classifier Validation Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Blended Base Learner Ridge Classifier Validation Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [440]:
##################################
# Gathering the model evaluation metrics
# for the train data
##################################
blended_baselearner_rc_optimal_train = model_performance_evaluation(y_preprocessed_train_encoded, blended_baselearner_rc_optimal.predict(X_preprocessed_train))
blended_baselearner_rc_optimal_train['model'] = ['blended_baselearner_rc_optimal'] * 5
blended_baselearner_rc_optimal_train['set'] = ['train'] * 5
print('Optimal Blended Base Learner Ridge Classifier Train Performance Metrics: ')
display(blended_baselearner_rc_optimal_train)
Optimal Blended Base Learner Ridge Classifier Train Performance Metrics: 
metric_name metric_value model set
0 Accuracy 0.887255 blended_baselearner_rc_optimal train
1 Precision 0.763889 blended_baselearner_rc_optimal train
2 Recall 0.901639 blended_baselearner_rc_optimal train
3 F1 0.827068 blended_baselearner_rc_optimal train
4 AUROC 0.891379 blended_baselearner_rc_optimal train
In [441]:
##################################
# Gathering the model evaluation metrics
# for the validation data
##################################
blended_baselearner_rc_optimal_validation = model_performance_evaluation(y_preprocessed_validation_encoded, blended_baselearner_rc_optimal.predict(X_preprocessed_validation))
blended_baselearner_rc_optimal_validation['model'] = ['blended_baselearner_rc_optimal'] * 5
blended_baselearner_rc_optimal_validation['set'] = ['validation'] * 5
print('Optimal Blended Base Learner Ridge Classifier Validation Performance Metrics: ')
display(blended_baselearner_rc_optimal_validation)
Optimal Blended Base Learner Ridge Classifier Validation Performance Metrics: 
metric_name metric_value model set
0 Accuracy 0.898551 blended_baselearner_rc_optimal validation
1 Precision 0.782609 blended_baselearner_rc_optimal validation
2 Recall 0.900000 blended_baselearner_rc_optimal validation
3 F1 0.837209 blended_baselearner_rc_optimal validation
4 AUROC 0.898980 blended_baselearner_rc_optimal validation
In [442]:
##################################
# Saving the best individual model
# developed from the train data
################################## 
joblib.dump(blended_baselearner_rc_optimal, 
            os.path.join("..", MODELS_PATH, "blended_model_baselearner_ridge_classifier_optimal.pkl"))
Out[442]:
['..\\models\\blended_model_baselearner_ridge_classifier_optimal.pkl']

1.10.4 Base Learner - Neural Network ¶

Neural Network is a classification algorithm inspired by the human brain, consisting of layers of interconnected neurons that transform input features through weighted connections and activation functions. It learns patterns in data through backpropagation, where the network adjusts its internal weights to minimize classification error. The process involves an input layer receiving data, multiple hidden layers extracting hierarchical features, and an output layer producing a final prediction. The key advantages of neural networks include their ability to model highly complex, non-linear relationships, making them suitable for image, text, and speech classification tasks. They are also highly scalable, capable of handling massive datasets. However, neural networks have several challenges: they require substantial computational resources, especially for deep architectures; they need large amounts of labeled data for effective training; and they are often difficult to interpret due to their "black box" nature. Additionally, hyperparameter tuning, including choosing the number of layers, neurons, and activation functions, is non-trivial and requires careful optimization to prevent overfitting or underfitting.

  1. The neural network model from the sklearn.neural_network Python library API was implemented.
  2. The model contains 3 hyperparameters for tuning:
    • hidden_layer_sizes = ith element represents the number of neurons in the ith hidden layer made to vary between (50,) and (100,)
    • activation = activation function for the hidden layer made to vary between relu and tanh
    • alpha = strength of the L2 regularization term made to vary between 0.0001 and 0.001
  3. No hyperparameter was defined in the model to address the minimal 2:1 class imbalance observed between the No and Yes Recurred categories.
  4. Hyperparameter tuning was conducted using the 5-cycle 5-fold cross-validation method with optimal model performance using the F1 score determined for:
    • hidden_layer_sizes = (50,)
    • activation = relu
    • alpha = 0.0001
  5. The apparent model performance of the optimal model is summarized as follows:
    • Accuracy = 0.8921
    • Precision = 0.8095
    • Recall = 0.8360
    • F1 Score = 0.8225
    • AUROC = 0.8760
  6. The independent validation model performance of the optimal model is summarized as follows:
    • Accuracy = 0.8840
    • Precision = 0.7727
    • Recall = 0.8500
    • F1 Score = 0.8095
    • AUROC = 0.8739
  7. Sufficiently comparable apparent and independent validation model performance observed that might be indicative of the absence of excessive model overfitting.
In [443]:
##################################
# Defining the categorical preprocessing parameters
##################################
categorical_features = ['Gender','Smoking','Physical_Examination','Adenopathy','Focality','Risk','T','Stage','Response']
categorical_transformer = OrdinalEncoder()
categorical_preprocessor = ColumnTransformer(transformers=[
    ('cat', categorical_transformer, categorical_features)],
                                             remainder='passthrough',
                                             force_int_remainder_cols=False)
In [444]:
##################################
# Defining the preprocessing and modeling pipeline parameters
##################################
blended_baselearner_nn_pipeline = Pipeline([
    ('categorical_preprocessor', categorical_preprocessor),
    ('blended_baselearner_nn_model', MLPClassifier(max_iter=500,
                                                   solver='lbfgs',
                                                   early_stopping=False,
                                                   random_state=987654321))
])
In [445]:
##################################
# Defining hyperparameter grid
##################################
blended_baselearner_nn_hyperparameter_grid = {
    'blended_baselearner_nn_model__hidden_layer_sizes': [(50,), (100,)],
    'blended_baselearner_nn_model__activation': ['relu', 'tanh'],
    'blended_baselearner_nn_model__alpha': [0.0001, 0.001]
}
In [446]:
##################################
# Defining the cross-validation strategy (5-cycle 5-fold CV)
##################################
cv_strategy = RepeatedStratifiedKFold(n_splits=5, 
                                      n_repeats=5, 
                                      random_state=987654321)
In [447]:
##################################
# Performing Grid Search with cross-validation
##################################
blended_baselearner_nn_grid_search = GridSearchCV(
    estimator=blended_baselearner_nn_pipeline,
    param_grid=blended_baselearner_nn_hyperparameter_grid,
    scoring='f1',
    cv=cv_strategy,
    n_jobs=-1,
    verbose=1
)
In [448]:
##################################
# Encoding the response variables
# for model evaluation
##################################
y_encoder = OrdinalEncoder()
y_encoder.fit(y_preprocessed_train.values.reshape(-1, 1))
y_preprocessed_train_encoded = y_encoder.transform(y_preprocessed_train.values.reshape(-1, 1)).ravel()
y_preprocessed_validation_encoded = y_encoder.transform(y_preprocessed_validation.values.reshape(-1, 1)).ravel()
In [449]:
##################################
# Fitting GridSearchCV
##################################
blended_baselearner_nn_grid_search.fit(X_preprocessed_train, y_preprocessed_train_encoded)
Fitting 25 folds for each of 8 candidates, totalling 200 fits
Out[449]:
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])])),
                                       ('blended_baselearner_nn_model',
                                        MLPClassifier(max_iter=500,
                                                      random_state=987654321,
                                                      solver='lbfgs'))]),
             n_jobs=-1,
             param_grid={'blended_baselearner_nn_model__activation': ['relu',
                                                                      'tanh'],
                         'blended_baselearner_nn_model__alpha': [0.0001, 0.001],
                         'blended_baselearner_nn_model__hidden_layer_sizes': [(50,),
                                                                              (100,)]},
             scoring='f1', verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])])),
                                       ('blended_baselearner_nn_model',
                                        MLPClassifier(max_iter=500,
                                                      random_state=987654321,
                                                      solver='lbfgs'))]),
             n_jobs=-1,
             param_grid={'blended_baselearner_nn_model__activation': ['relu',
                                                                      'tanh'],
                         'blended_baselearner_nn_model__alpha': [0.0001, 0.001],
                         'blended_baselearner_nn_model__hidden_layer_sizes': [(50,),
                                                                              (100,)]},
             scoring='f1', verbose=1)
Pipeline(steps=[('categorical_preprocessor',
                 ColumnTransformer(force_int_remainder_cols=False,
                                   remainder='passthrough',
                                   transformers=[('cat', OrdinalEncoder(),
                                                  ['Gender', 'Smoking',
                                                   'Physical_Examination',
                                                   'Adenopathy', 'Focality',
                                                   'Risk', 'T', 'Stage',
                                                   'Response'])])),
                ('blended_baselearner_nn_model',
                 MLPClassifier(hidden_layer_sizes=(50,), max_iter=500,
                               random_state=987654321, solver='lbfgs'))])
ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough',
                  transformers=[('cat', OrdinalEncoder(),
                                 ['Gender', 'Smoking', 'Physical_Examination',
                                  'Adenopathy', 'Focality', 'Risk', 'T',
                                  'Stage', 'Response'])])
['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response']
OrdinalEncoder()
['Age']
passthrough
MLPClassifier(hidden_layer_sizes=(50,), max_iter=500, random_state=987654321,
              solver='lbfgs')
In [450]:
##################################
# Identifying the best model
##################################
blended_baselearner_nn_optimal = blended_baselearner_nn_grid_search.best_estimator_
In [451]:
##################################
# Evaluating the F1 scores
# on the training, cross-validation, and validation data
##################################
blended_baselearner_nn_optimal_f1_cv = blended_baselearner_nn_grid_search.best_score_
blended_baselearner_nn_optimal_f1_train = f1_score(y_preprocessed_train_encoded, blended_baselearner_nn_optimal.predict(X_preprocessed_train))
blended_baselearner_nn_optimal_f1_validation = f1_score(y_preprocessed_validation_encoded, blended_baselearner_nn_optimal.predict(X_preprocessed_validation))
In [452]:
##################################
# Identifying the optimal model
##################################
print('Best Blended Base Learner Neural Network: ')
print(f"Best Blended Base Learner Neural Network Hyperparameters: {blended_baselearner_nn_grid_search.best_params_}")
Best Blended Base Learner Neural Network: 
Best Blended Base Learner Neural Network Hyperparameters: {'blended_baselearner_nn_model__activation': 'relu', 'blended_baselearner_nn_model__alpha': 0.0001, 'blended_baselearner_nn_model__hidden_layer_sizes': (50,)}
In [453]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the training and cross-validated data
# to assess overfitting optimism
##################################
print(f"F1 Score on Cross-Validated Data: {blended_baselearner_nn_optimal_f1_cv:.4f}")
print(f"F1 Score on Training Data: {blended_baselearner_nn_optimal_f1_train:.4f}")
print("\nClassification Report on Train Data:\n", classification_report(y_preprocessed_train_encoded, blended_baselearner_nn_optimal.predict(X_preprocessed_train)))
F1 Score on Cross-Validated Data: 0.8063
F1 Score on Training Data: 0.8226

Classification Report on Train Data:
               precision    recall  f1-score   support

         0.0       0.93      0.92      0.92       143
         1.0       0.81      0.84      0.82        61

    accuracy                           0.89       204
   macro avg       0.87      0.88      0.87       204
weighted avg       0.89      0.89      0.89       204

In [454]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the train data
##################################
cm_raw = confusion_matrix(y_preprocessed_train_encoded, blended_baselearner_nn_optimal.predict(X_preprocessed_train))
cm_normalized = confusion_matrix(y_preprocessed_train_encoded, blended_baselearner_nn_optimal.predict(X_preprocessed_train), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Blended Base Learner Neural Network Train Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Blended Base Learner Neural Network Train Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [455]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
# to assess overfitting optimism
##################################
print(f"F1 Score on Validation Data: {blended_baselearner_nn_optimal_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_preprocessed_validation_encoded, blended_baselearner_nn_optimal.predict(X_preprocessed_validation)))
F1 Score on Validation Data: 0.8095

Classification Report on Validation Data:
               precision    recall  f1-score   support

         0.0       0.94      0.90      0.92        49
         1.0       0.77      0.85      0.81        20

    accuracy                           0.88        69
   macro avg       0.85      0.87      0.86        69
weighted avg       0.89      0.88      0.89        69

In [456]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_preprocessed_validation_encoded, blended_baselearner_nn_optimal.predict(X_preprocessed_validation))
cm_normalized = confusion_matrix(y_preprocessed_validation_encoded, blended_baselearner_nn_optimal.predict(X_preprocessed_validation), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Blended Base Learner Neural Network Validation Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Blended Base Learner Neural Network Validation Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [457]:
##################################
# Gathering the model evaluation metrics
# for the train data
##################################
blended_baselearner_nn_optimal_train = model_performance_evaluation(y_preprocessed_train_encoded, blended_baselearner_nn_optimal.predict(X_preprocessed_train))
blended_baselearner_nn_optimal_train['model'] = ['blended_baselearner_nn_optimal'] * 5
blended_baselearner_nn_optimal_train['set'] = ['train'] * 5
print('Optimal Blended Base Learner Neural Network Train Performance Metrics: ')
display(blended_baselearner_nn_optimal_train)
Optimal Blended Base Learner Neural Network Train Performance Metrics: 
metric_name metric_value model set
0 Accuracy 0.892157 blended_baselearner_nn_optimal train
1 Precision 0.809524 blended_baselearner_nn_optimal train
2 Recall 0.836066 blended_baselearner_nn_optimal train
3 F1 0.822581 blended_baselearner_nn_optimal train
4 AUROC 0.876075 blended_baselearner_nn_optimal train
In [458]:
##################################
# Gathering the model evaluation metrics
# for the validation data
##################################
blended_baselearner_nn_optimal_validation = model_performance_evaluation(y_preprocessed_validation_encoded, blended_baselearner_nn_optimal.predict(X_preprocessed_validation))
blended_baselearner_nn_optimal_validation['model'] = ['blended_baselearner_nn_optimal'] * 5
blended_baselearner_nn_optimal_validation['set'] = ['validation'] * 5
print('Optimal Blended Base Learner Neural Network Validation Performance Metrics: ')
display(blended_baselearner_nn_optimal_validation)
Optimal Blended Base Learner Neural Network Validation Performance Metrics: 
metric_name metric_value model set
0 Accuracy 0.884058 blended_baselearner_nn_optimal validation
1 Precision 0.772727 blended_baselearner_nn_optimal validation
2 Recall 0.850000 blended_baselearner_nn_optimal validation
3 F1 0.809524 blended_baselearner_nn_optimal validation
4 AUROC 0.873980 blended_baselearner_nn_optimal validation
In [459]:
##################################
# Saving the best individual model
# developed from the train data
################################## 
joblib.dump(blended_baselearner_nn_optimal, 
            os.path.join("..", MODELS_PATH, "blended_model_baselearner_neural_network_optimal.pkl"))
Out[459]:
['..\\models\\blended_model_baselearner_neural_network_optimal.pkl']

1.10.5 Base Learner - Decision Tree ¶

Decision Tree is a hierarchical classification model that recursively splits data based on feature values, forming a tree-like structure where each node represents a decision rule and each leaf represents a class label. The tree is built using a greedy algorithm that selects the best feature at each step based on criteria such as information gain or Gini impurity. The main advantages of decision trees include their interpretability, as the decision-making process can be easily visualized and understood, and their ability to model non-linear relationships without requiring extensive feature engineering. They also handle both numerical and categorical data well. However, decision trees are prone to overfitting, especially when deep trees are grown without pruning. Small changes in the dataset can lead to entirely different structures, making them unstable. Additionally, they tend to perform poorly on highly complex problems where relationships between variables are intricate, making ensemble methods such as Random Forest or Gradient Boosting more effective in practice.

  1. The decision tree model from the sklearn.tree Python library API was implemented.
  2. The model contains 4 hyperparameters for tuning:
    • criterion = function to measure the quality of a split made to vary between gini and entropy
    • max_depth = maximum depth of the tree made to vary between 3 and 6
    • min_samples_leaf = minimum number of samples required to be at a leaf node made to vary between 5 and 10
  3. A special hyperparameter (class_weight = balanced) was fixed to address the minimal 2:1 class imbalance observed between the No and Yes Recurred categories.
  4. Hyperparameter tuning was conducted using the 5-cycle 5-fold cross-validation method with optimal model performance using the F1 score determined for:
    • criterion = gini
    • max_depth = 6
    • min_samples_leaf = 5
  5. The apparent model performance of the optimal model is summarized as follows:
    • Accuracy = 0.8970
    • Precision = 0.7500
    • Recall = 0.9836
    • F1 Score = 0.8510
    • AUROC = 0.9218
  6. The independent validation model performance of the optimal model is summarized as follows:
    • Accuracy = 0.8550
    • Precision = 0.6666
    • Recall = 1.0000
    • F1 Score = 0.8000
    • AUROC = 0.8979
  7. Sufficiently comparable apparent and independent validation model performance observed that might be indicative of the absence of excessive model overfitting.
In [460]:
##################################
# Defining the categorical preprocessing parameters
##################################
categorical_features = ['Gender','Smoking','Physical_Examination','Adenopathy','Focality','Risk','T','Stage','Response']
categorical_transformer = OrdinalEncoder()
categorical_preprocessor = ColumnTransformer(transformers=[
    ('cat', categorical_transformer, categorical_features)],
                                             remainder='passthrough',
                                             force_int_remainder_cols=False)
In [461]:
##################################
# Defining the preprocessing and modeling pipeline parameters
##################################
blended_baselearner_dt_pipeline = Pipeline([
    ('categorical_preprocessor', categorical_preprocessor),
    ('blended_baselearner_dt_model', DecisionTreeClassifier(class_weight='balanced',
                                                            random_state=987654321))
])
In [462]:
##################################
# Defining hyperparameter grid
##################################
blended_baselearner_dt_hyperparameter_grid = {
    'blended_baselearner_dt_model__criterion': ['gini', 'entropy'],
    'blended_baselearner_dt_model__max_depth': [3, 6],
    'blended_baselearner_dt_model__min_samples_leaf': [5, 10]
}
In [463]:
##################################
# Defining the cross-validation strategy (5-cycle 5-fold CV)
##################################
cv_strategy = RepeatedStratifiedKFold(n_splits=5, 
                                      n_repeats=5, 
                                      random_state=987654321)
In [464]:
##################################
# Performing Grid Search with cross-validation
##################################
blended_baselearner_dt_grid_search = GridSearchCV(
    estimator=blended_baselearner_dt_pipeline,
    param_grid=blended_baselearner_dt_hyperparameter_grid,
    scoring='f1',
    cv=cv_strategy,
    n_jobs=-1,
    verbose=1
)
In [465]:
##################################
# Encoding the response variables
# for model evaluation
##################################
y_encoder = OrdinalEncoder()
y_encoder.fit(y_preprocessed_train.values.reshape(-1, 1))
y_preprocessed_train_encoded = y_encoder.transform(y_preprocessed_train.values.reshape(-1, 1)).ravel()
y_preprocessed_validation_encoded = y_encoder.transform(y_preprocessed_validation.values.reshape(-1, 1)).ravel()
In [466]:
##################################
# Fitting GridSearchCV
##################################
blended_baselearner_dt_grid_search.fit(X_preprocessed_train, y_preprocessed_train_encoded)
Fitting 25 folds for each of 8 candidates, totalling 200 fits
Out[466]:
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])])),
                                       ('blended_baselearner_dt_model',
                                        DecisionTreeClassifier(class_weight='balanced',
                                                               random_state=987654321))]),
             n_jobs=-1,
             param_grid={'blended_baselearner_dt_model__criterion': ['gini',
                                                                     'entropy'],
                         'blended_baselearner_dt_model__max_depth': [3, 6],
                         'blended_baselearner_dt_model__min_samples_leaf': [5,
                                                                            10]},
             scoring='f1', verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])])),
                                       ('blended_baselearner_dt_model',
                                        DecisionTreeClassifier(class_weight='balanced',
                                                               random_state=987654321))]),
             n_jobs=-1,
             param_grid={'blended_baselearner_dt_model__criterion': ['gini',
                                                                     'entropy'],
                         'blended_baselearner_dt_model__max_depth': [3, 6],
                         'blended_baselearner_dt_model__min_samples_leaf': [5,
                                                                            10]},
             scoring='f1', verbose=1)
Pipeline(steps=[('categorical_preprocessor',
                 ColumnTransformer(force_int_remainder_cols=False,
                                   remainder='passthrough',
                                   transformers=[('cat', OrdinalEncoder(),
                                                  ['Gender', 'Smoking',
                                                   'Physical_Examination',
                                                   'Adenopathy', 'Focality',
                                                   'Risk', 'T', 'Stage',
                                                   'Response'])])),
                ('blended_baselearner_dt_model',
                 DecisionTreeClassifier(class_weight='balanced', max_depth=6,
                                        min_samples_leaf=5,
                                        random_state=987654321))])
ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough',
                  transformers=[('cat', OrdinalEncoder(),
                                 ['Gender', 'Smoking', 'Physical_Examination',
                                  'Adenopathy', 'Focality', 'Risk', 'T',
                                  'Stage', 'Response'])])
['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response']
OrdinalEncoder()
['Age']
passthrough
DecisionTreeClassifier(class_weight='balanced', max_depth=6, min_samples_leaf=5,
                       random_state=987654321)
In [467]:
##################################
# Identifying the best model
##################################
blended_baselearner_dt_optimal = blended_baselearner_dt_grid_search.best_estimator_
In [468]:
##################################
# Evaluating the F1 scores
# on the training, cross-validation, and validation data
##################################
blended_baselearner_dt_optimal_f1_cv = blended_baselearner_dt_grid_search.best_score_
blended_baselearner_dt_optimal_f1_train = f1_score(y_preprocessed_train_encoded, blended_baselearner_dt_optimal.predict(X_preprocessed_train))
blended_baselearner_dt_optimal_f1_validation = f1_score(y_preprocessed_validation_encoded, blended_baselearner_dt_optimal.predict(X_preprocessed_validation))
In [469]:
##################################
# Identifying the optimal model
##################################
print('Best Blended Base Learner Decision Trees: ')
print(f"Best Blended Base Learner Decision Trees Hyperparameters: {blended_baselearner_dt_grid_search.best_params_}")
Best Blended Base Learner Decision Trees: 
Best Blended Base Learner Decision Trees Hyperparameters: {'blended_baselearner_dt_model__criterion': 'gini', 'blended_baselearner_dt_model__max_depth': 6, 'blended_baselearner_dt_model__min_samples_leaf': 5}
In [470]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the training and cross-validated data
# to assess overfitting optimism
##################################
print(f"F1 Score on Cross-Validated Data: {blended_baselearner_dt_optimal_f1_cv:.4f}")
print(f"F1 Score on Training Data: {blended_baselearner_dt_optimal_f1_train:.4f}")
print("\nClassification Report on Train Data:\n", classification_report(y_preprocessed_train_encoded, blended_baselearner_dt_optimal.predict(X_preprocessed_train)))
F1 Score on Cross-Validated Data: 0.8099
F1 Score on Training Data: 0.8511

Classification Report on Train Data:
               precision    recall  f1-score   support

         0.0       0.99      0.86      0.92       143
         1.0       0.75      0.98      0.85        61

    accuracy                           0.90       204
   macro avg       0.87      0.92      0.89       204
weighted avg       0.92      0.90      0.90       204

In [471]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the train data
##################################
cm_raw = confusion_matrix(y_preprocessed_train_encoded, blended_baselearner_dt_optimal.predict(X_preprocessed_train))
cm_normalized = confusion_matrix(y_preprocessed_train_encoded, blended_baselearner_dt_optimal.predict(X_preprocessed_train), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Blended Base Learner Decision Trees Train Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Blended Base Learner Decision Trees Train Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [472]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
# to assess overfitting optimism
##################################
print(f"F1 Score on Validation Data: {blended_baselearner_dt_optimal_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_preprocessed_validation_encoded, blended_baselearner_dt_optimal.predict(X_preprocessed_validation)))
F1 Score on Validation Data: 0.8000

Classification Report on Validation Data:
               precision    recall  f1-score   support

         0.0       1.00      0.80      0.89        49
         1.0       0.67      1.00      0.80        20

    accuracy                           0.86        69
   macro avg       0.83      0.90      0.84        69
weighted avg       0.90      0.86      0.86        69

In [473]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_preprocessed_validation_encoded, blended_baselearner_dt_optimal.predict(X_preprocessed_validation))
cm_normalized = confusion_matrix(y_preprocessed_validation_encoded, blended_baselearner_dt_optimal.predict(X_preprocessed_validation), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Blended Base Learner Decision Trees Validation Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Blended Base Learner Decision Trees Validation Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [474]:
##################################
# Gathering the model evaluation metrics
# for the train data
##################################
blended_baselearner_dt_optimal_train = model_performance_evaluation(y_preprocessed_train_encoded, blended_baselearner_dt_optimal.predict(X_preprocessed_train))
blended_baselearner_dt_optimal_train['model'] = ['blended_baselearner_dt_optimal'] * 5
blended_baselearner_dt_optimal_train['set'] = ['train'] * 5
print('Optimal Blended Base Learner Decision Tree Train Performance Metrics: ')
display(blended_baselearner_dt_optimal_train)
Optimal Blended Base Learner Decision Tree Train Performance Metrics: 
metric_name metric_value model set
0 Accuracy 0.897059 blended_baselearner_dt_optimal train
1 Precision 0.750000 blended_baselearner_dt_optimal train
2 Recall 0.983607 blended_baselearner_dt_optimal train
3 F1 0.851064 blended_baselearner_dt_optimal train
4 AUROC 0.921873 blended_baselearner_dt_optimal train
In [475]:
##################################
# Gathering the model evaluation metrics
# for the validation data
##################################
blended_baselearner_dt_optimal_validation = model_performance_evaluation(y_preprocessed_validation_encoded, blended_baselearner_dt_optimal.predict(X_preprocessed_validation))
blended_baselearner_dt_optimal_validation['model'] = ['blended_baselearner_dt_optimal'] * 5
blended_baselearner_dt_optimal_validation['set'] = ['validation'] * 5
print('Optimal Blended Base Learner Decision Tree Validation Performance Metrics: ')
display(blended_baselearner_dt_optimal_validation)
Optimal Blended Base Learner Decision Tree Validation Performance Metrics: 
metric_name metric_value model set
0 Accuracy 0.855072 blended_baselearner_dt_optimal validation
1 Precision 0.666667 blended_baselearner_dt_optimal validation
2 Recall 1.000000 blended_baselearner_dt_optimal validation
3 F1 0.800000 blended_baselearner_dt_optimal validation
4 AUROC 0.897959 blended_baselearner_dt_optimal validation
In [476]:
##################################
# Saving the best individual model
# developed from the train data
################################## 
joblib.dump(blended_baselearner_dt_optimal, 
            os.path.join("..", MODELS_PATH, "blended_model_baselearner_decision_trees_optimal.pkl"))
Out[476]:
['..\\models\\blended_model_baselearner_decision_trees_optimal.pkl']

1.10.6 Meta Learner - Logistic Regression ¶

Logistic Regression is a linear classification algorithm that estimates the probability of a binary outcome using the logistic (sigmoid) function. It assumes a linear relationship between the predictor variables and the log-odds of the target class. The algorithm involves calculating a weighted sum of input features, applying the sigmoid function to transform the result into a probability, and assigning a class label based on a threshold (typically 0.5). Logistic regression is simple, interpretable, and computationally efficient, making it a popular choice for baseline models and problems where relationships between features and the target variable are approximately linear. It also provides insight into feature importance through its learned coefficients. However, logistic regression has limitations: it struggles with non-linear relationships unless feature engineering or polynomial terms are used, it is sensitive to multicollinearity, and it assumes independence between predictor variables, which may not always hold in real-world data. Additionally, it may perform poorly when classes are highly imbalanced, requiring techniques such as weighting or resampling to improve predictions.

  1. The logistic regression model from the sklearn.linear_model Python library API was implemented.
  2. The model contains 3 fixed hyperparameters:
    • C = inverse of regularization strength held constant at a value of 1.0
    • penalty = penalty norm held constant at a value of l2
    • solver = algorithm used in the optimization problem held constant at a value of lbfgs
  3. A special hyperparameter (class_weight = balanced) was fixed to address the minimal 2:1 class imbalance observed between the No and Yes Recurred categories.
  4. The apparent model performance of the optimal model is summarized as follows:
    • Accuracy = 0.9068
    • Precision = 0.8000
    • Recall = 0.9180
    • F1 Score = 0.8549
    • AUROC = 0.9100
  5. The independent validation model performance of the optimal model is summarized as follows:
    • Accuracy = 0.9275
    • Precision = 0.8260
    • Recall = 0.9500
    • F1 Score = 0.8837
    • AUROC = 0.9341
  6. Sufficiently comparable apparent and independent validation model performance observed that might be indicative of the absence of excessive model overfitting.
In [477]:
##################################
# Defining the blending strategy (75-25 development-holdout split)
##################################
X_preprocessed_train_development, X_preprocessed_holdout, y_preprocessed_train_development, y_preprocessed_holdout = train_test_split(
    X_preprocessed_train, y_preprocessed_train_encoded, 
    test_size=0.25, 
    random_state=987654321
)
In [478]:
##################################
# Loading the pre-trained base learners
# from the previously saved pickle files
##################################
blended_baselearners = {}
blended_baselearner_model = ['knn', 'svm', 'ridge_classifier', 'neural_network', 'decision_trees']
for name in blended_baselearner_model:
    blended_baselearner_model_path = os.path.join("..", MODELS_PATH, f"blended_model_baselearner_{name}_optimal.pkl")
    blended_baselearners[name] = joblib.load(blended_baselearner_model_path)
    
In [479]:
##################################
# Initializing the meta-feature matrices
##################################
meta_train_blended = np.zeros((X_preprocessed_holdout.shape[0], len(blended_baselearners)))
meta_validation_blended = np.zeros((X_preprocessed_validation.shape[0], len(blended_baselearners)))
In [480]:
##################################
# Generating hold-out predictions for training the meta learner
##################################
for i, (name, model) in enumerate(blended_baselearners.items()):
    model.fit(X_preprocessed_train_development, y_preprocessed_train_development)  
    meta_train_blended[:, i] = model.predict_proba(X_preprocessed_holdout)[:, 1] if hasattr(model, "predict_proba") else model.predict(X_preprocessed_holdout)
    meta_validation_blended[:, i] = model.predict_proba(X_preprocessed_validation)[:, 1] if hasattr(model, "predict_proba") else model.predict(X_preprocessed_validation)
In [481]:
##################################
# Training the meta learner on the stacked features
##################################
blended_metalearner_lr_optimal = LogisticRegression(class_weight='balanced', 
                                                    penalty='l2', 
                                                    C=1.0, 
                                                    solver='lbfgs', 
                                                    random_state=987654321)
blended_metalearner_lr_optimal.fit(meta_train_blended, y_preprocessed_holdout)
Out[481]:
LogisticRegression(class_weight='balanced', random_state=987654321)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(class_weight='balanced', random_state=987654321)
In [482]:
##################################
# Saving the meta learner model
# developed from the meta-train data
################################## 
joblib.dump(blended_metalearner_lr_optimal, 
            os.path.join("..", MODELS_PATH, "blended_model_metalearner_logistic_regression_optimal.pkl"))
Out[482]:
['..\\models\\blended_model_metalearner_logistic_regression_optimal.pkl']
In [483]:
##################################
# Creating a function to extract the 
# meta-feature matrices for new data
################################## 
def extract_blended_metafeature_matrix(X_preprocessed_new):
    ##################################
    # Loading the pre-trained base learners
    # from the previously saved pickle files
    ##################################
    blended_baselearners = {}
    blended_baselearner_model = ['knn', 'svm', 'ridge_classifier', 'neural_network', 'decision_trees']
    for name in blended_baselearner_model:
        blended_baselearner_model_path = (os.path.join("..", MODELS_PATH, f"blended_model_baselearner_{name}_optimal.pkl"))
        blended_baselearners[name] = joblib.load(blended_baselearner_model_path)

    ##################################
    # Generating meta-features for new data
    ##################################
    meta_train_blended = np.zeros((X_preprocessed_holdout.shape[0], len(blended_baselearners)))
    meta_new_blended = np.zeros((X_preprocessed_new.shape[0], len(blended_baselearners)))

    ##################################
    # Generating holdout predictions
    # from the base learners
    ##################################
    for i, (name, model) in enumerate(blended_baselearners.items()):
        model.fit(X_preprocessed_train_development, y_preprocessed_train_development) 
        meta_train_blended[:, i] = model.predict_proba(X_preprocessed_holdout)[:, 1] if hasattr(model, "predict_proba") else model.predict(X_preprocessed_holdout)
        meta_new_blended[:, i] = model.predict_proba(X_preprocessed_new)[:, 1] if hasattr(model, "predict_proba") else model.predict(X_preprocessed_new)

    return meta_new_blended
In [484]:
##################################
# Evaluating the F1 scores
# on the training and validation data
##################################
blended_metalearner_lr_optimal_f1_train = f1_score(y_preprocessed_train_encoded, blended_metalearner_lr_optimal.predict(extract_blended_metafeature_matrix(X_preprocessed_train)))
blended_metalearner_lr_optimal_f1_validation = f1_score(y_preprocessed_validation_encoded, blended_metalearner_lr_optimal.predict(extract_blended_metafeature_matrix(X_preprocessed_validation)))
In [485]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the training data
# to assess overfitting optimism
##################################
print(f"F1 Score on Training Data: {blended_metalearner_lr_optimal_f1_train:.4f}")
print("\nClassification Report on Train Data:\n", classification_report(y_preprocessed_train_encoded, blended_metalearner_lr_optimal.predict(extract_blended_metafeature_matrix(X_preprocessed_train))))
F1 Score on Training Data: 0.8550

Classification Report on Train Data:
               precision    recall  f1-score   support

         0.0       0.96      0.90      0.93       143
         1.0       0.80      0.92      0.85        61

    accuracy                           0.91       204
   macro avg       0.88      0.91      0.89       204
weighted avg       0.91      0.91      0.91       204

In [486]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the train data
##################################
cm_raw = confusion_matrix(y_preprocessed_train_encoded, blended_metalearner_lr_optimal.predict(extract_blended_metafeature_matrix(X_preprocessed_train)))
cm_normalized = confusion_matrix(y_preprocessed_train_encoded, blended_metalearner_lr_optimal.predict(extract_blended_metafeature_matrix(X_preprocessed_train)), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Blended Meta Learner Logistic Regression Train Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Blended Meta Learner Logistic Regression Train Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [487]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
# to assess overfitting optimism
##################################
print(f"F1 Score on Validationing Data: {blended_metalearner_lr_optimal_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_preprocessed_validation_encoded, blended_metalearner_lr_optimal.predict(extract_blended_metafeature_matrix(X_preprocessed_validation))))
F1 Score on Validationing Data: 0.8837

Classification Report on Validation Data:
               precision    recall  f1-score   support

         0.0       0.98      0.92      0.95        49
         1.0       0.83      0.95      0.88        20

    accuracy                           0.93        69
   macro avg       0.90      0.93      0.92        69
weighted avg       0.93      0.93      0.93        69

In [488]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_preprocessed_validation_encoded, blended_metalearner_lr_optimal.predict(extract_blended_metafeature_matrix(X_preprocessed_validation)))
cm_normalized = confusion_matrix(y_preprocessed_validation_encoded, blended_metalearner_lr_optimal.predict(extract_blended_metafeature_matrix(X_preprocessed_validation)), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Blended Meta Learner Logistic Regression Validation Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Blended Meta Learner Logistic Regression Validation Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [489]:
##################################
# Gathering the model evaluation metrics
# for the train data
##################################
blended_metalearner_lr_optimal_train = model_performance_evaluation(y_preprocessed_train_encoded, blended_metalearner_lr_optimal.predict(extract_blended_metafeature_matrix(X_preprocessed_train)))
blended_metalearner_lr_optimal_train['model'] = ['blended_metalearner_lr_optimal'] * 5
blended_metalearner_lr_optimal_train['set'] = ['train'] * 5
print('Optimal Blended Meta Learner Logistic Regression Train Performance Metrics: ')
display(blended_metalearner_lr_optimal_train)
Optimal Blended Meta Learner Logistic Regression Train Performance Metrics: 
metric_name metric_value model set
0 Accuracy 0.906863 blended_metalearner_lr_optimal train
1 Precision 0.800000 blended_metalearner_lr_optimal train
2 Recall 0.918033 blended_metalearner_lr_optimal train
3 F1 0.854962 blended_metalearner_lr_optimal train
4 AUROC 0.910065 blended_metalearner_lr_optimal train
In [490]:
##################################
# Gathering the model evaluation metrics
# for the validation data
##################################
blended_metalearner_lr_optimal_validation = model_performance_evaluation(y_preprocessed_validation_encoded, blended_metalearner_lr_optimal.predict(extract_blended_metafeature_matrix(X_preprocessed_validation)))
blended_metalearner_lr_optimal_validation['model'] = ['blended_metalearner_lr_optimal'] * 5
blended_metalearner_lr_optimal_validation['set'] = ['validation'] * 5
print('Optimal Blended Meta Learner Logistic Regression Validation Performance Metrics: ')
display(blended_metalearner_lr_optimal_validation)
Optimal Blended Meta Learner Logistic Regression Validation Performance Metrics: 
metric_name metric_value model set
0 Accuracy 0.927536 blended_metalearner_lr_optimal validation
1 Precision 0.826087 blended_metalearner_lr_optimal validation
2 Recall 0.950000 blended_metalearner_lr_optimal validation
3 F1 0.883721 blended_metalearner_lr_optimal validation
4 AUROC 0.934184 blended_metalearner_lr_optimal validation

1.11. Consolidated Summary¶

  1. Among 12 candidate models, the Blended Model developed from training a Meta Learner by combining predictions from multiple Base Learners was selected as the final model by demonstrating the best F1 Score for the independent validation data with minimal overfitting :
    • Apparent F1 Score Performance = 0.8549
    • Independent Validation F1 Score Performance = 0.8837
    • Independent Test F1 Score Performance = 0.8571
  2. The final model similarly demonstrated consistently high F1 Score for the test data :
    • Independent Test F1 Score Performance = 0.8571
  3. The final model configuration is described as follows:
    • Base Learner: k-nearest neighbors with optimal hyperparameters:
      • n_neighbors = 3
      • weights = uniform
      • metric = minkowski
    • Base Learner: support vector machine with optimal hyperparameters:
      • C = 1.0
      • kernel = linear
      • gamma = scale
    • Base Learner: ridge classifier with optimal hyperparameters:
      • alpha = 2.0
      • solver = saga
      • tol = 1e-4
    • Base Learner: neural network with optimal hyperparameters:
      • hidden_layer_sizes = (50,)
      • activation = relu
      • alpha = 0.0001
    • Base Learner: decision tree with optimal hyperparameters:
      • criterion = gini
      • max_depth = 6
      • min_samples_leaf = 5
    • Meta Learner: logistic regression model with optimal hyperparameters:
      • C = 1.0
      • penalty = l2
      • solver = lbfgs
  4. Only 2 of the 5 base learners demonstrated a significant contribution to the final prediction with positive values noted in terms of the permutation-based importance:
    • Base Learner: ridge classifier
    • Base Learner: support vector machine
  5. The remaining 3 base learners have not demonstrated significant contribution to the final prediction with negative values noted in terms of the permutation-based importance
    • Base Learner: decision tree
    • Base Learner: k-nearest neighbors
    • Base Learner: neural network
  6. For each of the significantly contributing base learners, the predictors with positive permutation-based importance are given as follows:
    • Base Learner: ridge classifier
      • Age
      • T
      • Focality
      • Smoking
      • Response
    • Base Learner: support vector machine
      • Age
      • T
In [491]:
##################################
# Consolidating all the
# bagged, boosted, stacked and blended
# model performance measures
# for the train and validation data
##################################
ensemble_train_validation_all_performance = pd.concat([bagged_rf_optimal_train,
                                             bagged_rf_optimal_validation,
                                             bagged_et_optimal_train,
                                             bagged_et_optimal_validation,
                                             bagged_bdt_optimal_train,
                                             bagged_bdt_optimal_validation,
                                             bagged_blr_optimal_train,
                                             bagged_blr_optimal_validation,
                                             bagged_bsvm_optimal_train,
                                             bagged_bsvm_optimal_validation,
                                             boosted_ab_optimal_train,
                                             boosted_ab_optimal_validation,
                                             boosted_gb_optimal_train,
                                             boosted_gb_optimal_validation,
                                             boosted_xgb_optimal_train,
                                             boosted_xgb_optimal_validation,
                                             boosted_lgbm_optimal_train,
                                             boosted_lgbm_optimal_validation,
                                             boosted_cb_optimal_train,
                                             boosted_cb_optimal_validation,
                                             stacked_baselearner_knn_optimal_train, 
                                             stacked_baselearner_knn_optimal_validation,
                                             stacked_baselearner_svm_optimal_train, 
                                             stacked_baselearner_svm_optimal_validation,
                                             stacked_baselearner_rc_optimal_train, 
                                             stacked_baselearner_rc_optimal_validation,
                                             stacked_baselearner_nn_optimal_train, 
                                             stacked_baselearner_nn_optimal_validation,
                                             stacked_baselearner_dt_optimal_train, 
                                             stacked_baselearner_dt_optimal_validation,
                                             stacked_metalearner_lr_optimal_train, 
                                             stacked_metalearner_lr_optimal_validation,
                                             blended_baselearner_knn_optimal_train, 
                                             blended_baselearner_knn_optimal_validation,
                                             blended_baselearner_svm_optimal_train, 
                                             blended_baselearner_svm_optimal_validation,
                                             blended_baselearner_rc_optimal_train, 
                                             blended_baselearner_rc_optimal_validation,
                                             blended_baselearner_nn_optimal_train, 
                                             blended_baselearner_nn_optimal_validation,
                                             blended_baselearner_dt_optimal_train, 
                                             blended_baselearner_dt_optimal_validation,
                                             blended_metalearner_lr_optimal_train, 
                                             blended_metalearner_lr_optimal_validation], 
                                            ignore_index=True)
print('Consolidated Ensemble Model Performance on Train and Validation Data: ')
display(ensemble_train_validation_all_performance)
Consolidated Ensemble Model Performance on Train and Validation Data: 
metric_name metric_value model set
0 Accuracy 0.892157 bagged_rf_optimal train
1 Precision 0.774648 bagged_rf_optimal train
2 Recall 0.901639 bagged_rf_optimal train
3 F1 0.833333 bagged_rf_optimal train
4 AUROC 0.894876 bagged_rf_optimal train
... ... ... ... ...
215 Accuracy 0.927536 blended_metalearner_lr_optimal validation
216 Precision 0.826087 blended_metalearner_lr_optimal validation
217 Recall 0.950000 blended_metalearner_lr_optimal validation
218 F1 0.883721 blended_metalearner_lr_optimal validation
219 AUROC 0.934184 blended_metalearner_lr_optimal validation

220 rows × 4 columns

In [492]:
##################################
# Consolidating all the F1 score
# model performance measures
# between the train and validation data
##################################
ensemble_train_validation_all_performance_F1 = ensemble_train_validation_all_performance[ensemble_train_validation_all_performance['metric_name']=='F1']
ensemble_train_validation_all_performance_F1_train = ensemble_train_validation_all_performance_F1[ensemble_train_validation_all_performance_F1['set']=='train'].loc[:,"metric_value"]
ensemble_train_validation_all_performance_F1_validation = ensemble_train_validation_all_performance_F1[ensemble_train_validation_all_performance_F1['set']=='validation'].loc[:,"metric_value"]
In [493]:
##################################
# Combining all the F1 score
# model performance measures
# between the train and validation data
##################################
ensemble_train_validation_all_performance_F1_plot = pd.DataFrame({'train': ensemble_train_validation_all_performance_F1_train.values,
                                                              'validation': ensemble_train_validation_all_performance_F1_validation.values},
                                                             index=ensemble_train_validation_all_performance_F1['model'].unique())
ensemble_train_validation_all_performance_F1_plot
Out[493]:
train validation
bagged_rf_optimal 0.833333 0.837209
bagged_et_optimal 0.833333 0.837209
bagged_bdt_optimal 0.846154 0.857143
bagged_blr_optimal 0.833333 0.837209
bagged_bsvm_optimal 0.852713 0.857143
boosted_ab_optimal 0.843750 0.857143
boosted_gb_optimal 0.910569 0.829268
boosted_xgb_optimal 0.850394 0.857143
boosted_lgbm_optimal 0.894309 0.820513
boosted_cb_optimal 0.843750 0.857143
stacked_baselearner_knn_optimal 0.862069 0.648649
stacked_baselearner_svm_optimal 0.843750 0.857143
stacked_baselearner_rc_optimal 0.827068 0.837209
stacked_baselearner_nn_optimal 0.822581 0.809524
stacked_baselearner_dt_optimal 0.851064 0.800000
stacked_metalearner_lr_optimal 0.852713 0.857143
blended_baselearner_knn_optimal 0.862069 0.648649
blended_baselearner_svm_optimal 0.843750 0.857143
blended_baselearner_rc_optimal 0.827068 0.837209
blended_baselearner_nn_optimal 0.822581 0.809524
blended_baselearner_dt_optimal 0.851064 0.800000
blended_metalearner_lr_optimal 0.854962 0.883721
In [494]:
##################################
# Plotting all the F1 score
# model performance measures
# between the train and validation sets
##################################
ensemble_train_validation_all_performance_F1_plot = ensemble_train_validation_all_performance_F1_plot.plot.barh(figsize=(10, 20), width=0.9)
ensemble_train_validation_all_performance_F1_plot.set_xlim(0.00,1.00)
ensemble_train_validation_all_performance_F1_plot.set_title("Model Comparison by F1 Score Performance on Train and Validation Data")
ensemble_train_validation_all_performance_F1_plot.set_xlabel("F1 Score Performance")
ensemble_train_validation_all_performance_F1_plot.set_ylabel("Ensemble Model")
ensemble_train_validation_all_performance_F1_plot.grid(False)
ensemble_train_validation_all_performance_F1_plot.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
for container in ensemble_train_validation_all_performance_F1_plot.containers:
    ensemble_train_validation_all_performance_F1_plot.bar_label(container, fmt='%.5f', padding=-50, color='white', fontweight='bold')
No description has been provided for this image
In [495]:
##################################
# Consolidating all the final
# bagged, boosted, stacked and blended
# model performance measures
# for the train and validation data
##################################
ensemble_train_validation_performance = ensemble_train_validation_all_performance[
    ~ensemble_train_validation_all_performance['model'].str.contains('baselearner', case=False, na=False)
]
print('Consolidated Final Ensemble Model Performance on Train and Validation Data: ')
display(ensemble_train_validation_performance)
Consolidated Final Ensemble Model Performance on Train and Validation Data: 
metric_name metric_value model set
0 Accuracy 0.892157 bagged_rf_optimal train
1 Precision 0.774648 bagged_rf_optimal train
2 Recall 0.901639 bagged_rf_optimal train
3 F1 0.833333 bagged_rf_optimal train
4 AUROC 0.894876 bagged_rf_optimal train
... ... ... ... ...
215 Accuracy 0.927536 blended_metalearner_lr_optimal validation
216 Precision 0.826087 blended_metalearner_lr_optimal validation
217 Recall 0.950000 blended_metalearner_lr_optimal validation
218 F1 0.883721 blended_metalearner_lr_optimal validation
219 AUROC 0.934184 blended_metalearner_lr_optimal validation

120 rows × 4 columns

In [496]:
##################################
# Consolidating all the F1 score
# model performance measures
# between the train and validation data
##################################
ensemble_train_validation_performance_F1 = ensemble_train_validation_performance[ensemble_train_validation_performance['metric_name']=='F1']
ensemble_train_validation_performance_F1_train = ensemble_train_validation_performance_F1[ensemble_train_validation_performance_F1['set']=='train'].loc[:,"metric_value"]
ensemble_train_validation_performance_F1_validation = ensemble_train_validation_performance_F1[ensemble_train_validation_performance_F1['set']=='validation'].loc[:,"metric_value"]
In [497]:
##################################
# Combining all the F1 score
# model performance measures
# between the train and validation data
##################################
ensemble_train_validation_performance_F1_plot = pd.DataFrame({'train': ensemble_train_validation_performance_F1_train.values,
                                                              'validation': ensemble_train_validation_performance_F1_validation.values},
                                                             index=ensemble_train_validation_performance_F1['model'].unique())
ensemble_train_validation_performance_F1_plot
Out[497]:
train validation
bagged_rf_optimal 0.833333 0.837209
bagged_et_optimal 0.833333 0.837209
bagged_bdt_optimal 0.846154 0.857143
bagged_blr_optimal 0.833333 0.837209
bagged_bsvm_optimal 0.852713 0.857143
boosted_ab_optimal 0.843750 0.857143
boosted_gb_optimal 0.910569 0.829268
boosted_xgb_optimal 0.850394 0.857143
boosted_lgbm_optimal 0.894309 0.820513
boosted_cb_optimal 0.843750 0.857143
stacked_metalearner_lr_optimal 0.852713 0.857143
blended_metalearner_lr_optimal 0.854962 0.883721
In [498]:
##################################
# Plotting all the F1 score
# model performance measures
# between the train and validation sets
##################################
ensemble_train_validation_performance_F1_plot = ensemble_train_validation_performance_F1_plot.plot.barh(figsize=(10, 10), width=0.9)
ensemble_train_validation_performance_F1_plot.set_xlim(0.00,1.00)
ensemble_train_validation_performance_F1_plot.set_title("Model Comparison by F1 Score Performance on Train and Validation Data")
ensemble_train_validation_performance_F1_plot.set_xlabel("F1 Score Performance")
ensemble_train_validation_performance_F1_plot.set_ylabel("Ensemble Model")
ensemble_train_validation_performance_F1_plot.grid(False)
ensemble_train_validation_performance_F1_plot.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
for container in ensemble_train_validation_performance_F1_plot.containers:
    ensemble_train_validation_performance_F1_plot.bar_label(container, fmt='%.5f', padding=-50, color='white', fontweight='bold')
No description has been provided for this image
In [499]:
##################################
# Gathering all model performance measures
# for the validation data
##################################
ensemble_train_validation_performance_Accuracy_validation = ensemble_train_validation_performance[(ensemble_train_validation_performance['set']=='validation') & (ensemble_train_validation_performance['metric_name']=='Accuracy')].loc[:,"metric_value"]
ensemble_train_validation_performance_Precision_validation = ensemble_train_validation_performance[(ensemble_train_validation_performance['set']=='validation') & (ensemble_train_validation_performance['metric_name']=='Precision')].loc[:,"metric_value"]
ensemble_train_validation_performance_Recall_validation = ensemble_train_validation_performance[(ensemble_train_validation_performance['set']=='validation') & (ensemble_train_validation_performance['metric_name']=='Recall')].loc[:,"metric_value"]
ensemble_train_validation_performance_F1_validation = ensemble_train_validation_performance[(ensemble_train_validation_performance['set']=='validation') & (ensemble_train_validation_performance['metric_name']=='F1')].loc[:,"metric_value"]
ensemble_train_validation_performance_AUROC_validation = ensemble_train_validation_performance[(ensemble_train_validation_performance['set']=='validation') & (ensemble_train_validation_performance['metric_name']=='AUROC')].loc[:,"metric_value"]
In [500]:
##################################
# Combining all the model performance measures
# for the validation data
##################################
ensemble_train_validation_performance_all_plot_validation = pd.DataFrame({'accuracy': ensemble_train_validation_performance_Accuracy_validation.values,
                                                                    'precision': ensemble_train_validation_performance_Precision_validation.values,
                                                                    'recall': ensemble_train_validation_performance_Recall_validation.values,
                                                                    'f1': ensemble_train_validation_performance_F1_validation.values,
                                                                    'auroc': ensemble_train_validation_performance_AUROC_validation.values},
                                                                   index=ensemble_train_validation_performance['model'].unique())
ensemble_train_validation_performance_all_plot_validation
Out[500]:
accuracy precision recall f1 auroc
bagged_rf_optimal 0.898551 0.782609 0.90 0.837209 0.898980
bagged_et_optimal 0.898551 0.782609 0.90 0.837209 0.898980
bagged_bdt_optimal 0.913043 0.818182 0.90 0.857143 0.909184
bagged_blr_optimal 0.898551 0.782609 0.90 0.837209 0.898980
bagged_bsvm_optimal 0.913043 0.818182 0.90 0.857143 0.909184
boosted_ab_optimal 0.913043 0.818182 0.90 0.857143 0.909184
boosted_gb_optimal 0.898551 0.809524 0.85 0.829268 0.884184
boosted_xgb_optimal 0.913043 0.818182 0.90 0.857143 0.909184
boosted_lgbm_optimal 0.898551 0.842105 0.80 0.820513 0.869388
boosted_cb_optimal 0.913043 0.818182 0.90 0.857143 0.909184
stacked_metalearner_lr_optimal 0.913043 0.818182 0.90 0.857143 0.909184
blended_metalearner_lr_optimal 0.927536 0.826087 0.95 0.883721 0.934184
In [501]:
##################################
# Gathering the model evaluation metrics
# for the test data
##################################
##################################
# Defining a dictionary of models and 
# their corresponding feature extraction functions
##################################
models = {
    'bagged_rf_optimal': bagged_rf_optimal,
    'bagged_et_optimal': bagged_et_optimal,
    'bagged_bdt_optimal': bagged_bdt_optimal,
    'bagged_blr_optimal': bagged_blr_optimal,
    'bagged_bsvm_optimal': bagged_bsvm_optimal,
    'boosted_ab_optimal': boosted_ab_optimal,
    'boosted_gb_optimal': boosted_gb_optimal,
    'boosted_xgb_optimal': boosted_xgb_optimal,
    'boosted_lgbm_optimal': boosted_lgbm_optimal,
    'boosted_cb_optimal': boosted_cb_optimal,
    'stacked_baselearner_knn_optimal': stacked_baselearner_knn_optimal,
    'stacked_baselearner_svm_optimal': stacked_baselearner_svm_optimal,
    'stacked_baselearner_rc_optimal': stacked_baselearner_rc_optimal,
    'stacked_baselearner_nn_optimal': stacked_baselearner_nn_optimal,
    'stacked_baselearner_dt_optimal': stacked_baselearner_dt_optimal,
    'stacked_metalearner_lr_optimal': stacked_metalearner_lr_optimal,
    'blended_baselearner_knn_optimal': blended_baselearner_knn_optimal,
    'blended_baselearner_svm_optimal': blended_baselearner_svm_optimal,
    'blended_baselearner_rc_optimal': blended_baselearner_rc_optimal,
    'blended_baselearner_nn_optimal': blended_baselearner_nn_optimal,
    'blended_baselearner_dt_optimal': blended_baselearner_dt_optimal,
    'blended_metalearner_lr_optimal': blended_metalearner_lr_optimal
}

##################################
# Defining transformation functions for meta-learners
##################################
feature_extractors = {
    'stacked_metalearner_lr_optimal': extract_stacked_metafeature_matrix,
    'blended_metalearner_lr_optimal': extract_blended_metafeature_matrix
}
In [502]:
##################################
# Encoding the response variables
# for the test data
##################################
y_preprocessed_test_encoded = y_encoder.transform(y_preprocessed_test.values.reshape(-1, 1)).ravel()
In [503]:
##################################
# Storing the model evaluation metrics
# for the test data
##################################
ensemble_test_all_performance = []

##################################
# Looping through each model 
# and evaluate performance on test data
##################################
for model_name, model in models.items():
    # Applying transformation if needed (for meta-learner)
    X_input = feature_extractors.get(model_name, lambda x: x)(X_preprocessed_test)
    
    # Evaluating performance
    ensemble_test_all_performance_results = model_performance_evaluation(y_preprocessed_test_encoded, model.predict(X_input))
    
    # Adding metadata columns
    ensemble_test_all_performance_results['model'] = model_name
    ensemble_test_all_performance_results['set'] = 'test'
    
    # Storing result
    ensemble_test_all_performance.append(ensemble_test_all_performance_results)
In [504]:
##################################
# Consolidating all model performance measures
# for the test data
##################################
ensemble_test_all_performance = pd.concat(ensemble_test_all_performance, ignore_index=True)
print('Consolidated Ensemble Model Performance on Test Data: ')
display(ensemble_test_all_performance)
Consolidated Ensemble Model Performance on Test Data: 
metric_name metric_value model set
0 Accuracy 0.901099 bagged_rf_optimal test
1 Precision 0.821429 bagged_rf_optimal test
2 Recall 0.851852 bagged_rf_optimal test
3 F1 0.836364 bagged_rf_optimal test
4 AUROC 0.886863 bagged_rf_optimal test
... ... ... ... ...
105 Accuracy 0.912088 blended_metalearner_lr_optimal test
106 Precision 0.827586 blended_metalearner_lr_optimal test
107 Recall 0.888889 blended_metalearner_lr_optimal test
108 F1 0.857143 blended_metalearner_lr_optimal test
109 AUROC 0.905382 blended_metalearner_lr_optimal test

110 rows × 4 columns

In [505]:
##################################
# Consolidating all the final
# bagged, boosted, stacked and blended
# model performance measures
# for the test data
##################################
ensemble_test_performance = ensemble_test_all_performance[
    ~ensemble_test_all_performance['model'].str.contains('baselearner', case=False, na=False)
]
print('Consolidated Final Ensemble Model Performance on Test Data: ')
display(ensemble_test_performance)
Consolidated Final Ensemble Model Performance on Test Data: 
metric_name metric_value model set
0 Accuracy 0.901099 bagged_rf_optimal test
1 Precision 0.821429 bagged_rf_optimal test
2 Recall 0.851852 bagged_rf_optimal test
3 F1 0.836364 bagged_rf_optimal test
4 AUROC 0.886863 bagged_rf_optimal test
5 Accuracy 0.912088 bagged_et_optimal test
6 Precision 0.851852 bagged_et_optimal test
7 Recall 0.851852 bagged_et_optimal test
8 F1 0.851852 bagged_et_optimal test
9 AUROC 0.894676 bagged_et_optimal test
10 Accuracy 0.912088 bagged_bdt_optimal test
11 Precision 0.851852 bagged_bdt_optimal test
12 Recall 0.851852 bagged_bdt_optimal test
13 F1 0.851852 bagged_bdt_optimal test
14 AUROC 0.894676 bagged_bdt_optimal test
15 Accuracy 0.901099 bagged_blr_optimal test
16 Precision 0.800000 bagged_blr_optimal test
17 Recall 0.888889 bagged_blr_optimal test
18 F1 0.842105 bagged_blr_optimal test
19 AUROC 0.897569 bagged_blr_optimal test
20 Accuracy 0.912088 bagged_bsvm_optimal test
21 Precision 0.827586 bagged_bsvm_optimal test
22 Recall 0.888889 bagged_bsvm_optimal test
23 F1 0.857143 bagged_bsvm_optimal test
24 AUROC 0.905382 bagged_bsvm_optimal test
25 Accuracy 0.912088 boosted_ab_optimal test
26 Precision 0.851852 boosted_ab_optimal test
27 Recall 0.851852 boosted_ab_optimal test
28 F1 0.851852 boosted_ab_optimal test
29 AUROC 0.894676 boosted_ab_optimal test
30 Accuracy 0.923077 boosted_gb_optimal test
31 Precision 0.884615 boosted_gb_optimal test
32 Recall 0.851852 boosted_gb_optimal test
33 F1 0.867925 boosted_gb_optimal test
34 AUROC 0.902488 boosted_gb_optimal test
35 Accuracy 0.901099 boosted_xgb_optimal test
36 Precision 0.846154 boosted_xgb_optimal test
37 Recall 0.814815 boosted_xgb_optimal test
38 F1 0.830189 boosted_xgb_optimal test
39 AUROC 0.876157 boosted_xgb_optimal test
40 Accuracy 0.912088 boosted_lgbm_optimal test
41 Precision 0.880000 boosted_lgbm_optimal test
42 Recall 0.814815 boosted_lgbm_optimal test
43 F1 0.846154 boosted_lgbm_optimal test
44 AUROC 0.883970 boosted_lgbm_optimal test
45 Accuracy 0.912088 boosted_cb_optimal test
46 Precision 0.851852 boosted_cb_optimal test
47 Recall 0.851852 boosted_cb_optimal test
48 F1 0.851852 boosted_cb_optimal test
49 AUROC 0.894676 boosted_cb_optimal test
75 Accuracy 0.923077 stacked_metalearner_lr_optimal test
76 Precision 0.857143 stacked_metalearner_lr_optimal test
77 Recall 0.888889 stacked_metalearner_lr_optimal test
78 F1 0.872727 stacked_metalearner_lr_optimal test
79 AUROC 0.913194 stacked_metalearner_lr_optimal test
105 Accuracy 0.912088 blended_metalearner_lr_optimal test
106 Precision 0.827586 blended_metalearner_lr_optimal test
107 Recall 0.888889 blended_metalearner_lr_optimal test
108 F1 0.857143 blended_metalearner_lr_optimal test
109 AUROC 0.905382 blended_metalearner_lr_optimal test
In [506]:
##################################
# Gathering all model performance measures
# for the test data
##################################
ensemble_test_performance_Accuracy_test = ensemble_test_performance[(ensemble_test_performance['set']=='test') & (ensemble_test_performance['metric_name']=='Accuracy')].loc[:,"metric_value"]
ensemble_test_performance_Precision_test = ensemble_test_performance[(ensemble_test_performance['set']=='test') & (ensemble_test_performance['metric_name']=='Precision')].loc[:,"metric_value"]
ensemble_test_performance_Recall_test = ensemble_test_performance[(ensemble_test_performance['set']=='test') & (ensemble_test_performance['metric_name']=='Recall')].loc[:,"metric_value"]
ensemble_test_performance_F1_test = ensemble_test_performance[(ensemble_test_performance['set']=='test') & (ensemble_test_performance['metric_name']=='F1')].loc[:,"metric_value"]
ensemble_test_performance_AUROC_test = ensemble_test_performance[(ensemble_test_performance['set']=='test') & (ensemble_test_performance['metric_name']=='AUROC')].loc[:,"metric_value"]
In [507]:
##################################
# Combining all the model performance measures
# for the test data
##################################
ensemble_test_performance_all_plot_test = pd.DataFrame({'accuracy': ensemble_test_performance_Accuracy_test.values,
                                                                    'precision': ensemble_test_performance_Precision_test.values,
                                                                    'recall': ensemble_test_performance_Recall_test.values,
                                                                    'f1': ensemble_test_performance_F1_test.values,
                                                                    'auroc': ensemble_test_performance_AUROC_test.values},
                                                                   index=ensemble_test_performance['model'].unique())
ensemble_test_performance_all_plot_test
Out[507]:
accuracy precision recall f1 auroc
bagged_rf_optimal 0.901099 0.821429 0.851852 0.836364 0.886863
bagged_et_optimal 0.912088 0.851852 0.851852 0.851852 0.894676
bagged_bdt_optimal 0.912088 0.851852 0.851852 0.851852 0.894676
bagged_blr_optimal 0.901099 0.800000 0.888889 0.842105 0.897569
bagged_bsvm_optimal 0.912088 0.827586 0.888889 0.857143 0.905382
boosted_ab_optimal 0.912088 0.851852 0.851852 0.851852 0.894676
boosted_gb_optimal 0.923077 0.884615 0.851852 0.867925 0.902488
boosted_xgb_optimal 0.901099 0.846154 0.814815 0.830189 0.876157
boosted_lgbm_optimal 0.912088 0.880000 0.814815 0.846154 0.883970
boosted_cb_optimal 0.912088 0.851852 0.851852 0.851852 0.894676
stacked_metalearner_lr_optimal 0.923077 0.857143 0.888889 0.872727 0.913194
blended_metalearner_lr_optimal 0.912088 0.827586 0.888889 0.857143 0.905382
In [508]:
##################################
# Consolidating all the final
# bagged, boosted, stacked and blended
# model performance measures
# for the train, validation and test data
##################################
ensemble_overall_performance = pd.concat([ensemble_train_validation_performance, ensemble_test_performance], axis=0)
In [509]:
##################################
# Consolidating all the F1 score
# model performance measures
# between the train, validation and test data
##################################
ensemble_overall_performance_F1 = ensemble_overall_performance[ensemble_overall_performance['metric_name']=='F1']
ensemble_overall_performance_F1_train = ensemble_overall_performance_F1[ensemble_overall_performance_F1['set']=='train'].loc[:,"metric_value"]
ensemble_overall_performance_F1_validation = ensemble_overall_performance_F1[ensemble_overall_performance_F1['set']=='validation'].loc[:,"metric_value"]
ensemble_overall_performance_F1_test = ensemble_overall_performance_F1[ensemble_overall_performance_F1['set']=='test'].loc[:,"metric_value"]
In [510]:
##################################
# Combining all the F1 score
# model performance measures
# between the train and validation data
##################################
ensemble_overall_performance_F1_plot = pd.DataFrame({'train': ensemble_overall_performance_F1_train.values,
                                                     'validation': ensemble_overall_performance_F1_validation.values,
                                                     'test': ensemble_overall_performance_F1_test.values},
                                                    index=ensemble_overall_performance_F1['model'].unique())
ensemble_overall_performance_F1_plot
Out[510]:
train validation test
bagged_rf_optimal 0.833333 0.837209 0.836364
bagged_et_optimal 0.833333 0.837209 0.851852
bagged_bdt_optimal 0.846154 0.857143 0.851852
bagged_blr_optimal 0.833333 0.837209 0.842105
bagged_bsvm_optimal 0.852713 0.857143 0.857143
boosted_ab_optimal 0.843750 0.857143 0.851852
boosted_gb_optimal 0.910569 0.829268 0.867925
boosted_xgb_optimal 0.850394 0.857143 0.830189
boosted_lgbm_optimal 0.894309 0.820513 0.846154
boosted_cb_optimal 0.843750 0.857143 0.851852
stacked_metalearner_lr_optimal 0.852713 0.857143 0.872727
blended_metalearner_lr_optimal 0.854962 0.883721 0.857143
In [511]:
##################################
# Plotting all the F1 score
# model performance measures
# between train, validation and test sets
##################################
ensemble_overall_performance_F1_plot = ensemble_overall_performance_F1_plot.plot.barh(figsize=(10, 10), width=0.9)
ensemble_overall_performance_F1_plot.set_xlim(0.00,1.00)
ensemble_overall_performance_F1_plot.set_title("Model Comparison by F1 Score Performance on Train, Validation and Test Data")
ensemble_overall_performance_F1_plot.set_xlabel("F1 Score Performance")
ensemble_overall_performance_F1_plot.set_ylabel("Ensemble Model")
ensemble_overall_performance_F1_plot.grid(False)
ensemble_overall_performance_F1_plot.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
for container in ensemble_overall_performance_F1_plot.containers:
    ensemble_overall_performance_F1_plot.bar_label(container, fmt='%.5f', padding=-50, color='white', fontweight='bold')
No description has been provided for this image
In [512]:
##################################
# Computing the permutation importance
# for the final model determined as the blended model
# with a Logistic Regression meta learner comprised of the 
# KNN, SVM, Ridge Classifier, Neural Network and Decision Tree base learners
##################################
base_learner_names = ['KNN', 'SVM', 'Ridge Classifier', 'Neural Network', 'Decision Tree']
perm_importance = permutation_importance(
    blended_metalearner_lr_optimal,  # Meta Learner
    meta_validation_blended,         # Meta Features (Base Learner Predictions)
    y_preprocessed_validation_encoded,       # True Labels
    n_repeats=10, 
    random_state=42
)

# Obtaining the sorted indices in descending order
sorted_idx = perm_importance.importances_mean.argsort()[::-1]

# Plotting the feature importance
plt.figure(figsize=(17, 5))
plt.bar(range(len(perm_importance.importances_mean)), perm_importance.importances_mean[sorted_idx], align='center')
plt.xticks(range(len(perm_importance.importances_mean)), np.array(base_learner_names)[sorted_idx], rotation=90)
plt.xlabel("Base Learner")
plt.ylabel("Permutation Importance Score")
plt.title("Permutation Importance: Blended Model (Meta Learner: Logistic Regression, Base Learners: KNN, SVM, Ridge Classifier, Neural Network, Decision Tree)")
plt.show()
No description has been provided for this image
In [513]:
##################################
# Creating a function to compute the permutation importance
# for the KNN, SVM, Ridge Classifier, Neural Network and Decision Tree base learners
##################################
feature_names = ['Gender','Smoking','Physical_Examination','Adenopathy','Focality','Risk','T','Stage','Response','Age']
def compute_permutation_importance(model, X_evaluation, y_evaluation, model_name="Model", feature_names=feature_names, n_repeats=10, random_state=42):
    # Computing permutation importance
    perm_importance = permutation_importance(model, X_evaluation, y_evaluation, n_repeats=n_repeats, random_state=random_state)

    # Getting the sorted indices (descending order)
    sorted_idx = perm_importance.importances_mean.argsort()[::-1]

    # Using feature names if provided, else using column indices
    if feature_names is None:
        feature_names = [f"Feature {i}" for i in range(X_evaluation.shape[1])]

    # Plotting feature importance
    plt.figure(figsize=(17, 5))
    plt.bar(range(len(perm_importance.importances_mean)), perm_importance.importances_mean[sorted_idx], align='center')
    plt.xticks(range(len(perm_importance.importances_mean)), np.array(feature_names)[sorted_idx], rotation=90)
    plt.xlabel("Feature")
    plt.ylabel("Permutation Importance Score")
    plt.title(f"Feature Importance (Permutation): {model_name}")
    plt.show()

    return perm_importance
In [514]:
##################################
# Computing the permutation importance
# for the Ridge Classifier base learner
##################################
perm_importance_blended_baselearner_rc_optimal = compute_permutation_importance(blended_baselearner_rc_optimal, 
                                                                                X_preprocessed_train, 
                                                                                y_preprocessed_train_encoded, 
                                                                                "Optimal Blended Base Learner Ridge Classifier",
                                                                                feature_names=feature_names)
No description has been provided for this image
In [515]:
##################################
# Computing the permutation importance
# for the Ridge Classifier base learner
##################################
perm_importance_blended_baselearner_svm_optimal = compute_permutation_importance(blended_baselearner_svm_optimal, 
                                                                                 X_preprocessed_train, 
                                                                                 y_preprocessed_train_encoded, 
                                                                                 "Optimal Blended Base Learner SVM",
                                                                                 feature_names=feature_names)
No description has been provided for this image
In [516]:
##################################
# Computing the permutation importance
# for the Decision Tree base learner
##################################
perm_importance_blended_baselearner_dt_optimal = compute_permutation_importance(blended_baselearner_dt_optimal, 
                                                                                X_preprocessed_train, 
                                                                                y_preprocessed_train_encoded, 
                                                                                "Optimal Blended Base Learner Decision Tree",
                                                                                feature_names=feature_names)
No description has been provided for this image
In [517]:
##################################
# Computing the permutation importance
# for the KNN base learner
##################################
perm_importance_blended_baselearner_knn_optimal = compute_permutation_importance(blended_baselearner_knn_optimal, 
                                                                                 X_preprocessed_train, 
                                                                                 y_preprocessed_train_encoded, 
                                                                                 "Optimal Blended Base Learner KNN",
                                                                                 feature_names=feature_names)
No description has been provided for this image
In [518]:
##################################
# Computing the permutation importance
# for the Neural Network base learner
##################################
perm_importance_blended_baselearner_nn_optimal = compute_permutation_importance(blended_baselearner_nn_optimal, 
                                                                                X_preprocessed_train, 
                                                                                y_preprocessed_train_encoded, 
                                                                                "Optimal Blended Base Learner Neural Network",
                                                                                feature_names=feature_names)
No description has been provided for this image

2. Summary ¶

Project59_Summary.png

3. References ¶

  • [Book] Ensemble Methods for Machine Learning by Gautam Kunapuli
  • [Book] Applied Predictive Modeling by Max Kuhn and Kjell Johnson
  • [Book] An Introduction to Statistical Learning by Gareth James, Daniela Witten, Trevor Hastie and Rob Tibshirani
  • [Book] Ensemble Methods: Foundations and Algorithms by Zhi-Hua Zhou
  • [Book] Effective XGBoost: Optimizing, Tuning, Understanding, and Deploying Classification Models (Treading on Python) by Matt Harrison, Edward Krueger, Alex Rook, Ronald Legere and Bojan Tunguz
  • [Python Library API] NumPy by NumPy Team
  • [Python Library API] pandas by Pandas Team
  • [Python Library API] seaborn by Seaborn Team
  • [Python Library API] matplotlib.pyplot by MatPlotLib Team
  • [Python Library API] matplotlib.image by MatPlotLib Team
  • [Python Library API] matplotlib.offsetbox by MatPlotLib Team
  • [Python Library API] itertools by Python Team
  • [Python Library API] operator by Python Team
  • [Python Library API] sklearn.experimental by Scikit-Learn Team
  • [Python Library API] sklearn.impute by Scikit-Learn Team
  • [Python Library API] sklearn.linear_model by Scikit-Learn Team
  • [Python Library API] sklearn.preprocessing by Scikit-Learn Team
  • [Python Library API] scipy by SciPy Team
  • [Python Library API] sklearn.tree by Schttps://scikit-learn.org/stable/api/sklearn.neighbors.htmlikit-Learn Team
  • [Python Library API] sklearn.ensemble by Scikit-Learn Team
  • [Python Library API] sklearn.svm by Scikit-Learn Team
  • [Python Library API] sklearn.metrics by Scikit-Learn Team
  • [Python Library API] sklearn.neighbors by Scikit-Learn Team
  • [Python Library API] sklearn.neural_network by Scikit-Learn Team
  • [Python Library API] xgboost by XGBoost Team
  • [Python Library API] lightgbm by LightGBM Team
  • [Python Library API] catboost by CatBoost Team
  • [Python Library API] imblearn.over_sampling by Imbalanced-Learn Team
  • [Python Library API] imblearn.under_sampling by Imbalanced-Learn Team
  • [Python Library API] StatsModels by StatsModels Team
  • [Python Library API] SciPy by SciPy Team
  • [Article] Ensemble: Boosting, Bagging, and Stacking Machine Learning by Jason Brownlee (MachineLearningMastery.Com)
  • [Article] Stacking Machine Learning: Everything You Need to Know by Ada Parker (MachineLearningPro.Org)
  • [Article] Ensemble Learning: Bagging, Boosting and Stacking by Edouard Duchesnay, Tommy Lofstedt and Feki Younes (Duchesnay.GitHub.IO)
  • [Article] Stack Machine Learning Models: Get Better Results by Casper Hansen (Developer.IBM.Com)
  • [Article] GradientBoosting vs AdaBoost vs XGBoost vs CatBoost vs LightGBM by Geeks for Geeks Team (GeeksForGeeks.Org)
  • [Article] A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning by Jason Brownlee (MachineLearningMastery.Com)
  • [Article] The Ultimate Guide to AdaBoost Algorithm | What is AdaBoost Algorithm? by Ashish Kumar (MyGreatLearning.Com)
  • [Article] A Gentle Introduction to Ensemble Learning Algorithms by Jason Brownlee (MachineLearningMastery.Com)
  • [Article] Ensemble Methods: Elegant Techniques to Produce Improved Machine Learning Results by Necati Demir (Toptal.Com)
  • [Article] The Essential Guide to Ensemble Learning by Rohit Kundu (V7Labs.Com)
  • [Article] Develop an Intuition for How Ensemble Learning Works by by Jason Brownlee (Machine Learning Mastery)
  • [Article] Mastering Ensemble Techniques in Machine Learning: Bagging, Boosting, Bayes Optimal Classifier, and Stacking by Rahul Jain (Medium)
  • [Article] Ensemble Learning: Bagging, Boosting, Stacking by Ayşe Kübra Kuyucu (Medium)
  • [Article] Ensemble: Boosting, Bagging, and Stacking Machine Learning by Aleyna Şenozan (Medium)
  • [Article] Boosting, Stacking, and Bagging for Ensemble Models for Time Series Analysis with Python by Kyle Jones (Medium)
  • [Article] Different types of Ensemble Techniques — Bagging, Boosting, Stacking, Voting, Blending by Abhishek Jain (Medium)
  • [Article] Mastering Ensemble Techniques in Machine Learning: Bagging, Boosting, Bayes Optimal Classifier, and Stacking by Rahul Jain (Medium)
  • [Article] Understanding Ensemble Methods: Bagging, Boosting, and Stacking by Divya bhagat (Medium)
  • [Video Tutorial] BAGGING vs. BOOSTING vs STACKING in Ensemble Learning | Machine Learning by Gate Smashers (YouTube)
  • [Video Tutorial] What is Ensemble Method in Machine Learning | Bagging | Boosting | Stacking | Voting by Data_SPILL (YouTube)
  • [Video Tutorial] Ensemble Methods | Bagging | Boosting | Stacking by World of Signet (YouTube)
  • [Video Tutorial] Ensemble (Boosting, Bagging, and Stacking) in Machine Learning: Easy Explanation for Data Scientists by Emma Ding (YouTube)
  • [Video Tutorial] Ensemble Learning - Bagging, Boosting, and Stacking explained in 4 minutes! by Melissa Van Bussel (YouTube)
  • [Video Tutorial] Introduction to Ensemble Learning | Bagging , Boosting & Stacking Techniques by UncomplicatingTech (YouTube)
  • [Video Tutorial] Machine Learning Basics: Ensemble Learning: Bagging, Boosting, Stacking by ISSAI_NU (YouTube)
  • [Course] DataCamp Python Data Analyst Certificate by DataCamp Team (DataCamp)
  • [Course] DataCamp Python Associate Data Scientist Certificate by DataCamp Team (DataCamp)
  • [Course] DataCamp Python Data Scientist Certificate by DataCamp Team (DataCamp)
  • [Course] DataCamp Machine Learning Engineer Certificate by DataCamp Team (DataCamp)
  • [Course] DataCamp Machine Learning Scientist Certificate by DataCamp Team (DataCamp)
  • [Course] IBM Data Analyst Professional Certificate by IBM Team (Coursera)
  • [Course] IBM Data Science Professional Certificate by IBM Team (Coursera)
  • [Course] IBM Machine Learning Professional Certificate by IBM Team (Coursera)
In [519]:
from IPython.display import display, HTML
display(HTML("<style>.rendered_html { font-size: 15px; font-family: 'Trebuchet MS'; }</style>"))