Supervised Learning : Leveraging Ensemble Learning With Bagging, Boosting, Stacking and Blending Approaches¶
- 1. Table of Contents
- 2. Summary
- 3. References
1. Table of Contents ¶
This project explores different Ensemble Learning approaches which combine the predictions from multiple models in an effort to achieve better predictive performance using various helpful packages in Python. The ensemble frameworks applied in the analysis were grouped into three classes including the Bagging Approach which fits many individual learners on different samples of the same dataset and averages the predictions; Boosting Approach which adds ensemble members sequentially that correct the predictions made by prior models and outputs a weighted average of the predictions; and Stacking or Blending Approach which consolidates many different and diverse learners on the same data and uses another model to learn how to best combine the predictions. Bagged models applied were the Random Forest, Extra Trees, Bagged Decision Tree, Bagged Logistic Regression and Bagged Support Vector Machine algorithms. Boosting models included the AdaBoost, Stochastic Gradient Boosting, Extreme Gradient Boosting, Light Gradient Boosting Machines and CatBoost algorithms. Individual base learners including the K-Nearest Neighbors, Support Vector Machine, Ridge Classifier, Neural Network and Decision Tree algorithms were stacked or blended together as contributors to the Logistic Regression meta-model. The resulting predictions derived from all ensemble learning models were independtly evaluated on a test set based on accuracy and F1 score metrics. All results were consolidated in a Summary presented at the end of the document.
Ensemble Learning is a machine learning technique that improves predictive accuracy by combining multiple models to leverage their collective strengths. Traditional machine learning models often struggle with either high bias, which leads to overly simplistic predictions, or high variance, which makes them too sensitive to fluctuations in the data. Ensemble learning addresses these challenges by aggregating the outputs of several models, creating a more robust and reliable predictor. In classification problems, this can be done through majority voting, weighted averaging, or more advanced meta-learning techniques. The key advantage of ensemble learning is its ability to reduce both bias and variance, leading to better generalization on unseen data. However, this comes at the cost of increased computational complexity and interpretability, as managing multiple models requires more resources and makes it harder to explain predictions.
Bagging (Bootstrap Aggregating) is an ensemble learning technique that reduces model variance by training multiple instances of the same algorithm on different randomly sampled subsets of the training data. The fundamental problem bagging aims to solve is overfitting, particularly in high-variance models. By generating multiple bootstrap samples—random subsets created through sampling with replacement — bagging ensures that each model is trained on slightly different data, making the overall prediction more stable. In classification problems, the final output is obtained by majority voting among the individual models, while in regression, their predictions are averaged. Bagging is particularly effective when dealing with noisy datasets, as it smooths out individual model errors. However, its effectiveness is limited for low-variance models, and the requirement to train multiple models increases computational cost.
Boosting is an ensemble learning method that builds a strong classifier by training models sequentially, where each new model focuses on correcting the mistakes of its predecessors. Boosting assigns higher weights to misclassified instances, ensuring that subsequent models pay more attention to these hard-to-classify cases. The motivation behind boosting is to reduce both bias and variance by iteratively refining weak learners — models that perform only slightly better than random guessing — until they collectively form a strong classifier. In classification tasks, predictions are refined by combining weighted outputs of multiple weak models, typically decision stumps or shallow trees. This makes boosting highly effective in uncovering complex patterns in data. However, the sequential nature of boosting makes it computationally expensive compared to bagging, and it is more prone to overfitting if the number of weak learners is too high.
Stacking, or stacked generalization, is an advanced ensemble method that improves predictive performance by training a meta-model to learn the optimal way to combine multiple base models using their out-of-fold predictions. Unlike traditional ensemble techniques such as bagging and boosting, which aggregate predictions through simple rules like averaging or majority voting, stacking introduces a second-level model that intelligently learns how to integrate diverse base models. The process starts by training multiple classifiers on the training dataset. However, instead of directly using their predictions, stacking employs k-fold cross-validation to generate out-of-fold predictions. Specifically, each base model is trained on a subset of the training data while leaving out a validation fold, and predictions on that unseen fold are recorded. This process is repeated across all folds, ensuring that each instance in the training data receives predictions from models that never saw it during training. These out-of-fold predictions are then used as input features for a meta-model, which learns the best way to combine them into a final decision. The advantage of stacking is that it allows different models to complement each other, capturing diverse aspects of the data that a single model might miss. This often results in superior classification accuracy compared to individual models or simpler ensemble approaches. However, stacking is computationally expensive, requiring multiple training iterations for base models and the additional meta-model. It also demands careful tuning to prevent overfitting, as the meta-model’s complexity can introduce new sources of error. Despite these challenges, stacking remains a powerful technique in applications where maximizing predictive performance is a priority.
Blending is an ensemble technique that enhances classification accuracy by training a meta-model on a holdout validation set, rather than using out-of-fold predictions like stacking. This simplifies implementation while maintaining the benefits of combining multiple base models. The process of blending starts by training base models on the full training dataset. Instead of applying cross-validation to obtain out-of-fold predictions, blending reserves a small portion of the training data as a holdout set. The base models make predictions on this unseen holdout set, and these predictions are then used as input features for a meta-model, which learns how to optimally combine them into a final classification decision. Since the meta-model is trained on predictions from unseen data, it avoids the risk of overfitting that can sometimes occur when base models are evaluated on the same data they were trained on. Blending is motivated by its simplicity and ease of implementation compared to stacking, as it eliminates the need for repeated k-fold cross-validation to generate training data for the meta-model. However, one drawback is that the meta-model has access to fewer training examples, as a portion of the data is withheld for validation rather than being used for training. This can limit the generalization ability of the final model, especially if the holdout set is too small. Despite this limitation, blending remains a useful approach in applications where a quick and effective ensemble method is needed without the computational overhead of stacking.
1.1. Data Background ¶
An open Thyroid Disease Dataset from Kaggle (with all credits attributed to Jai Naru and Abuchi Onwuegbusi) was used for the analysis as consolidated from the following primary sources:
- Reference Repository entitled Differentiated Thyroid Cancer Recurrence from UC Irvine Machine Learning Repository
- Research Paper entitled Machine Learning for Risk Stratification of Thyroid Cancer Patients: a 15-year Cohort Study from the European Archives of Oto-Rhino-Laryngology
This study hypothesized that the various clinicopathological characteristics influence differentiated thyroid cancer recurrence between patients.
The dichotomous categorical variable for the study is:
- Recurred - Status of the patient (Yes, Recurrence of differentiated thyroid cancer | No, No recurrence of differentiated thyroid cancer)
The predictor variables for the study are:
- Age - Patient's age (Years)
- Gender - Patient's sex (M | F)
- Smoking - Indication of smoking (Yes | No)
- Hx Smoking - Indication of smoking history (Yes | No)
- Hx Radiotherapy - Indication of radiotherapy history for any condition (Yes | No)
- Thyroid Function - Status of thyroid function (Clinical Hyperthyroidism, Hypothyroidism | Subclinical Hyperthyroidism, Hypothyroidism | Euthyroid)
- Physical Examination - Findings from physical examination including palpation of the thyroid gland and surrounding structures (Normal | Diffuse Goiter | Multinodular Goiter | Single Nodular Goiter Left, Right)
- Adenopathy - Indication of enlarged lymph nodes in the neck region (No | Right | Extensive | Left | Bilateral | Posterior)
- Pathology - Specific thyroid cancer type as determined by pathology examination of biopsy samples (Follicular | Hurthel Cell | Micropapillary | Papillary)
- Focality - Indication if the cancer is limited to one location or present in multiple locations (Uni-Focal | Multi-Focal)
- Risk - Risk category of the cancer based on various factors, such as tumor size, extent of spread, and histological type (Low | Intermediate | High)
- T - Tumor classification based on its size and extent of invasion into nearby structures (T1a | T1b | T2 | T3a | T3b | T4a | T4b)
- N - Nodal classification indicating the involvement of lymph nodes (N0 | N1a | N1b)
- M - Metastasis classification indicating the presence or absence of distant metastases (M0 | M1)
- Stage - Overall stage of the cancer, typically determined by combining T, N, and M classifications (I | II | III | IVa | IVb)
- Response - Cancer's response to treatment (Biochemical Incomplete | Indeterminate | Excellent | Structural Incomplete)
1.2. Data Description ¶
- The initial tabular dataset was comprised of 383 observations and 17 variables (including 1 target and 16 predictors).
- 383 rows (observations)
- 17 columns (variables)
- 1/17 target (categorical)
- Recurred
- 1/17 predictor (numeric)
- Age
- 16/17 predictor (categorical)
- Gender
- Smoking
- Hx_Smoking
- Hx_Radiotherapy
- Thyroid_Function
- Physical_Examination
- Adenopathy
- Pathology
- Focality
- Risk
- T
- N
- M
- Stage
- Response
- 1/17 target (categorical)
##################################
# Loading Python Libraries
##################################
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import joblib
import itertools
import os
import pickle
%matplotlib inline
from operator import add,mul,truediv
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.preprocessing import OrdinalEncoder, StandardScaler
from scipy import stats
from scipy.stats import pointbiserialr, chi2_contingency
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, BaggingClassifier, AdaBoostClassifier, GradientBoostingClassifier, StackingClassifier
from sklearn.neural_network import MLPClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix, ConfusionMatrixDisplay, classification_report
from sklearn.model_selection import train_test_split, GridSearchCV, RepeatedStratifiedKFold, KFold, cross_val_score
from sklearn.inspection import permutation_importance
##################################
# Defining file paths
##################################
DATASETS_ORIGINAL_PATH = r"datasets\original"
DATASETS_FINAL_PATH = r"datasets\final\complete"
DATASETS_FINAL_TRAIN_PATH = r"datasets\final\train"
DATASETS_FINAL_TRAIN_FEATURES_PATH = r"datasets\final\train\features"
DATASETS_FINAL_TRAIN_TARGET_PATH = r"datasets\final\train\target"
DATASETS_FINAL_VALIDATION_PATH = r"datasets\final\validation"
DATASETS_FINAL_VALIDATION_FEATURES_PATH = r"datasets\final\validation\features"
DATASETS_FINAL_VALIDATION_TARGET_PATH = r"datasets\final\validation\target"
DATASETS_FINAL_TEST_PATH = r"datasets\final\test"
DATASETS_FINAL_TEST_FEATURES_PATH = r"datasets\final\test\features"
DATASETS_FINAL_TEST_TARGET_PATH = r"datasets\final\test\target"
DATASETS_PREPROCESSED_PATH = r"datasets\preprocessed"
DATASETS_PREPROCESSED_TRAIN_PATH = r"datasets\preprocessed\train"
DATASETS_PREPROCESSED_TRAIN_FEATURES_PATH = r"datasets\preprocessed\train\features"
DATASETS_PREPROCESSED_TRAIN_TARGET_PATH = r"datasets\preprocessed\train\target"
DATASETS_PREPROCESSED_VALIDATION_PATH = r"datasets\preprocessed\validation"
DATASETS_PREPROCESSED_VALIDATION_FEATURES_PATH = r"datasets\preprocessed\validation\features"
DATASETS_PREPROCESSED_VALIDATION_TARGET_PATH = r"datasets\preprocessed\validation\target"
DATASETS_PREPROCESSED_TEST_PATH = r"datasets\preprocessed\test"
DATASETS_PREPROCESSED_TEST_FEATURES_PATH = r"datasets\preprocessed\test\features"
DATASETS_PREPROCESSED_TEST_TARGET_PATH = r"datasets\preprocessed\test\target"
MODELS_PATH = r"models"
##################################
# Loading the dataset
# from the DATASETS_ORIGINAL_PATH
##################################
thyroid_cancer = pd.read_csv(os.path.join("..", DATASETS_ORIGINAL_PATH, "Thyroid_Diff.csv"))
##################################
# Performing a general exploration of the dataset
##################################
print('Dataset Dimensions: ')
display(thyroid_cancer.shape)
Dataset Dimensions:
(383, 17)
##################################
# Listing the column names and data types
##################################
print('Column Names and Data Types:')
display(thyroid_cancer.dtypes)
Column Names and Data Types:
Age int64 Gender object Smoking object Hx Smoking object Hx Radiotherapy object Thyroid Function object Physical Examination object Adenopathy object Pathology object Focality object Risk object T object N object M object Stage object Response object Recurred object dtype: object
##################################
# Renaming and standardizing the column names
# to replace blanks with undercores
##################################
thyroid_cancer.columns = thyroid_cancer.columns.str.replace(" ", "_")
##################################
# Taking a snapshot of the dataset
##################################
thyroid_cancer.head()
Age | Gender | Smoking | Hx_Smoking | Hx_Radiotherapy | Thyroid_Function | Physical_Examination | Adenopathy | Pathology | Focality | Risk | T | N | M | Stage | Response | Recurred | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 27 | F | No | No | No | Euthyroid | Single nodular goiter-left | No | Micropapillary | Uni-Focal | Low | T1a | N0 | M0 | I | Indeterminate | No |
1 | 34 | F | No | Yes | No | Euthyroid | Multinodular goiter | No | Micropapillary | Uni-Focal | Low | T1a | N0 | M0 | I | Excellent | No |
2 | 30 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Micropapillary | Uni-Focal | Low | T1a | N0 | M0 | I | Excellent | No |
3 | 62 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Micropapillary | Uni-Focal | Low | T1a | N0 | M0 | I | Excellent | No |
4 | 62 | F | No | No | No | Euthyroid | Multinodular goiter | No | Micropapillary | Multi-Focal | Low | T1a | N0 | M0 | I | Excellent | No |
##################################
# Selecting categorical columns (both object and categorical types)
# and listing the unique categorical levels
##################################
cat_cols = thyroid_cancer.select_dtypes(include=["object", "category"]).columns
for col in cat_cols:
print(f"Categorical | Object Column: {col}")
print(thyroid_cancer[col].unique())
print("-" * 40)
Categorical | Object Column: Gender ['F' 'M'] ---------------------------------------- Categorical | Object Column: Smoking ['No' 'Yes'] ---------------------------------------- Categorical | Object Column: Hx_Smoking ['No' 'Yes'] ---------------------------------------- Categorical | Object Column: Hx_Radiotherapy ['No' 'Yes'] ---------------------------------------- Categorical | Object Column: Thyroid_Function ['Euthyroid' 'Clinical Hyperthyroidism' 'Clinical Hypothyroidism' 'Subclinical Hyperthyroidism' 'Subclinical Hypothyroidism'] ---------------------------------------- Categorical | Object Column: Physical_Examination ['Single nodular goiter-left' 'Multinodular goiter' 'Single nodular goiter-right' 'Normal' 'Diffuse goiter'] ---------------------------------------- Categorical | Object Column: Adenopathy ['No' 'Right' 'Extensive' 'Left' 'Bilateral' 'Posterior'] ---------------------------------------- Categorical | Object Column: Pathology ['Micropapillary' 'Papillary' 'Follicular' 'Hurthel cell'] ---------------------------------------- Categorical | Object Column: Focality ['Uni-Focal' 'Multi-Focal'] ---------------------------------------- Categorical | Object Column: Risk ['Low' 'Intermediate' 'High'] ---------------------------------------- Categorical | Object Column: T ['T1a' 'T1b' 'T2' 'T3a' 'T3b' 'T4a' 'T4b'] ---------------------------------------- Categorical | Object Column: N ['N0' 'N1b' 'N1a'] ---------------------------------------- Categorical | Object Column: M ['M0' 'M1'] ---------------------------------------- Categorical | Object Column: Stage ['I' 'II' 'IVB' 'III' 'IVA'] ---------------------------------------- Categorical | Object Column: Response ['Indeterminate' 'Excellent' 'Structural Incomplete' 'Biochemical Incomplete'] ---------------------------------------- Categorical | Object Column: Recurred ['No' 'Yes'] ----------------------------------------
##################################
# Correcting a category level
##################################
thyroid_cancer["Pathology"] = thyroid_cancer["Pathology"].replace("Hurthel cell", "Hurthle Cell")
##################################
# Setting the levels of the categorical variables
##################################
thyroid_cancer['Recurred'] = thyroid_cancer['Recurred'].astype('category')
thyroid_cancer['Recurred'] = thyroid_cancer['Recurred'].cat.set_categories(['No', 'Yes'], ordered=True)
thyroid_cancer['Gender'] = thyroid_cancer['Gender'].astype('category')
thyroid_cancer['Gender'] = thyroid_cancer['Gender'].cat.set_categories(['M', 'F'], ordered=True)
thyroid_cancer['Smoking'] = thyroid_cancer['Smoking'].astype('category')
thyroid_cancer['Smoking'] = thyroid_cancer['Smoking'].cat.set_categories(['No', 'Yes'], ordered=True)
thyroid_cancer['Hx_Smoking'] = thyroid_cancer['Hx_Smoking'].astype('category')
thyroid_cancer['Hx_Smoking'] = thyroid_cancer['Hx_Smoking'].cat.set_categories(['No', 'Yes'], ordered=True)
thyroid_cancer['Hx_Radiotherapy'] = thyroid_cancer['Hx_Radiotherapy'].astype('category')
thyroid_cancer['Hx_Radiotherapy'] = thyroid_cancer['Hx_Radiotherapy'].cat.set_categories(['No', 'Yes'], ordered=True)
thyroid_cancer['Thyroid_Function'] = thyroid_cancer['Thyroid_Function'].astype('category')
thyroid_cancer['Thyroid_Function'] = thyroid_cancer['Thyroid_Function'].cat.set_categories(['Euthyroid', 'Subclinical Hypothyroidism', 'Subclinical Hyperthyroidism', 'Clinical Hypothyroidism', 'Clinical Hyperthyroidism'], ordered=True)
thyroid_cancer['Physical_Examination'] = thyroid_cancer['Physical_Examination'].astype('category')
thyroid_cancer['Physical_Examination'] = thyroid_cancer['Physical_Examination'].cat.set_categories(['Normal', 'Single nodular goiter-left', 'Single nodular goiter-right', 'Multinodular goiter', 'Diffuse goiter'], ordered=True)
thyroid_cancer['Adenopathy'] = thyroid_cancer['Adenopathy'].astype('category')
thyroid_cancer['Adenopathy'] = thyroid_cancer['Adenopathy'].cat.set_categories(['No', 'Left', 'Right', 'Bilateral', 'Posterior', 'Extensive'], ordered=True)
thyroid_cancer['Pathology'] = thyroid_cancer['Pathology'].astype('category')
thyroid_cancer['Pathology'] = thyroid_cancer['Pathology'].cat.set_categories(['Hurthle Cell', 'Follicular', 'Micropapillary', 'Papillary'], ordered=True)
thyroid_cancer['Focality'] = thyroid_cancer['Focality'].astype('category')
thyroid_cancer['Focality'] = thyroid_cancer['Focality'].cat.set_categories(['Uni-Focal', 'Multi-Focal'], ordered=True)
thyroid_cancer['Risk'] = thyroid_cancer['Risk'].astype('category')
thyroid_cancer['Risk'] = thyroid_cancer['Risk'].cat.set_categories(['Low', 'Intermediate', 'High'], ordered=True)
thyroid_cancer['T'] = thyroid_cancer['T'].astype('category')
thyroid_cancer['T'] = thyroid_cancer['T'].cat.set_categories(['T1a', 'T1b', 'T2', 'T3a', 'T3b', 'T4a', 'T4b'], ordered=True)
thyroid_cancer['N'] = thyroid_cancer['N'].astype('category')
thyroid_cancer['N'] = thyroid_cancer['N'].cat.set_categories(['N0', 'N1a', 'N1b'], ordered=True)
thyroid_cancer['M'] = thyroid_cancer['M'].astype('category')
thyroid_cancer['M'] = thyroid_cancer['M'].cat.set_categories(['M0', 'M1'], ordered=True)
thyroid_cancer['Stage'] = thyroid_cancer['Stage'].astype('category')
thyroid_cancer['Stage'] = thyroid_cancer['Stage'].cat.set_categories(['I', 'II', 'III', 'IVA', 'IVB'], ordered=True)
thyroid_cancer['Response'] = thyroid_cancer['Response'].astype('category')
thyroid_cancer['Response'] = thyroid_cancer['Response'].cat.set_categories(['Excellent', 'Structural Incomplete', 'Biochemical Incomplete', 'Indeterminate'], ordered=True)
##################################
# Performing a general exploration of the numeric variables
##################################
print('Numeric Variable Summary:')
display(thyroid_cancer.describe(include='number').transpose())
Numeric Variable Summary:
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
Age | 383.0 | 40.866841 | 15.134494 | 15.0 | 29.0 | 37.0 | 51.0 | 82.0 |
##################################
# Performing a general exploration of the categorical variables
##################################
print('Categorical Variable Summary:')
display(thyroid_cancer.describe(include='category').transpose())
Categorical Variable Summary:
count | unique | top | freq | |
---|---|---|---|---|
Gender | 383 | 2 | F | 312 |
Smoking | 383 | 2 | No | 334 |
Hx_Smoking | 383 | 2 | No | 355 |
Hx_Radiotherapy | 383 | 2 | No | 376 |
Thyroid_Function | 383 | 5 | Euthyroid | 332 |
Physical_Examination | 383 | 5 | Single nodular goiter-right | 140 |
Adenopathy | 383 | 6 | No | 277 |
Pathology | 383 | 4 | Papillary | 287 |
Focality | 383 | 2 | Uni-Focal | 247 |
Risk | 383 | 3 | Low | 249 |
T | 383 | 7 | T2 | 151 |
N | 383 | 3 | N0 | 268 |
M | 383 | 2 | M0 | 365 |
Stage | 383 | 5 | I | 333 |
Response | 383 | 4 | Excellent | 208 |
Recurred | 383 | 2 | No | 275 |
##################################
# Performing a general exploration of the categorical variable levels
# based on the ordered categories
##################################
ordered_cat_cols = thyroid_cancer.select_dtypes(include=["category"]).columns
for col in ordered_cat_cols:
print(f"Column: {col}")
print("Absolute Frequencies:")
print(thyroid_cancer[col].value_counts().reindex(thyroid_cancer[col].cat.categories))
print("\nNormalized Frequencies:")
print(thyroid_cancer[col].value_counts(normalize=True).reindex(thyroid_cancer[col].cat.categories))
print("-" * 50)
Column: Gender Absolute Frequencies: M 71 F 312 Name: count, dtype: int64 Normalized Frequencies: M 0.185379 F 0.814621 Name: proportion, dtype: float64 -------------------------------------------------- Column: Smoking Absolute Frequencies: No 334 Yes 49 Name: count, dtype: int64 Normalized Frequencies: No 0.872063 Yes 0.127937 Name: proportion, dtype: float64 -------------------------------------------------- Column: Hx_Smoking Absolute Frequencies: No 355 Yes 28 Name: count, dtype: int64 Normalized Frequencies: No 0.926893 Yes 0.073107 Name: proportion, dtype: float64 -------------------------------------------------- Column: Hx_Radiotherapy Absolute Frequencies: No 376 Yes 7 Name: count, dtype: int64 Normalized Frequencies: No 0.981723 Yes 0.018277 Name: proportion, dtype: float64 -------------------------------------------------- Column: Thyroid_Function Absolute Frequencies: Euthyroid 332 Subclinical Hypothyroidism 14 Subclinical Hyperthyroidism 5 Clinical Hypothyroidism 12 Clinical Hyperthyroidism 20 Name: count, dtype: int64 Normalized Frequencies: Euthyroid 0.866841 Subclinical Hypothyroidism 0.036554 Subclinical Hyperthyroidism 0.013055 Clinical Hypothyroidism 0.031332 Clinical Hyperthyroidism 0.052219 Name: proportion, dtype: float64 -------------------------------------------------- Column: Physical_Examination Absolute Frequencies: Normal 7 Single nodular goiter-left 89 Single nodular goiter-right 140 Multinodular goiter 140 Diffuse goiter 7 Name: count, dtype: int64 Normalized Frequencies: Normal 0.018277 Single nodular goiter-left 0.232376 Single nodular goiter-right 0.365535 Multinodular goiter 0.365535 Diffuse goiter 0.018277 Name: proportion, dtype: float64 -------------------------------------------------- Column: Adenopathy Absolute Frequencies: No 277 Left 17 Right 48 Bilateral 32 Posterior 2 Extensive 7 Name: count, dtype: int64 Normalized Frequencies: No 0.723238 Left 0.044386 Right 0.125326 Bilateral 0.083551 Posterior 0.005222 Extensive 0.018277 Name: proportion, dtype: float64 -------------------------------------------------- Column: Pathology Absolute Frequencies: Hurthle Cell 20 Follicular 28 Micropapillary 48 Papillary 287 Name: count, dtype: int64 Normalized Frequencies: Hurthle Cell 0.052219 Follicular 0.073107 Micropapillary 0.125326 Papillary 0.749347 Name: proportion, dtype: float64 -------------------------------------------------- Column: Focality Absolute Frequencies: Uni-Focal 247 Multi-Focal 136 Name: count, dtype: int64 Normalized Frequencies: Uni-Focal 0.644909 Multi-Focal 0.355091 Name: proportion, dtype: float64 -------------------------------------------------- Column: Risk Absolute Frequencies: Low 249 Intermediate 102 High 32 Name: count, dtype: int64 Normalized Frequencies: Low 0.650131 Intermediate 0.266319 High 0.083551 Name: proportion, dtype: float64 -------------------------------------------------- Column: T Absolute Frequencies: T1a 49 T1b 43 T2 151 T3a 96 T3b 16 T4a 20 T4b 8 Name: count, dtype: int64 Normalized Frequencies: T1a 0.127937 T1b 0.112272 T2 0.394256 T3a 0.250653 T3b 0.041775 T4a 0.052219 T4b 0.020888 Name: proportion, dtype: float64 -------------------------------------------------- Column: N Absolute Frequencies: N0 268 N1a 22 N1b 93 Name: count, dtype: int64 Normalized Frequencies: N0 0.699739 N1a 0.057441 N1b 0.242820 Name: proportion, dtype: float64 -------------------------------------------------- Column: M Absolute Frequencies: M0 365 M1 18 Name: count, dtype: int64 Normalized Frequencies: M0 0.953003 M1 0.046997 Name: proportion, dtype: float64 -------------------------------------------------- Column: Stage Absolute Frequencies: I 333 II 32 III 4 IVA 3 IVB 11 Name: count, dtype: int64 Normalized Frequencies: I 0.869452 II 0.083551 III 0.010444 IVA 0.007833 IVB 0.028721 Name: proportion, dtype: float64 -------------------------------------------------- Column: Response Absolute Frequencies: Excellent 208 Structural Incomplete 91 Biochemical Incomplete 23 Indeterminate 61 Name: count, dtype: int64 Normalized Frequencies: Excellent 0.543081 Structural Incomplete 0.237598 Biochemical Incomplete 0.060052 Indeterminate 0.159269 Name: proportion, dtype: float64 -------------------------------------------------- Column: Recurred Absolute Frequencies: No 275 Yes 108 Name: count, dtype: int64 Normalized Frequencies: No 0.718016 Yes 0.281984 Name: proportion, dtype: float64 --------------------------------------------------
1.3. Data Quality Assessment ¶
Data quality findings based on assessment are as follows:
- A total of 19 duplicated rows were identified.
- In total, 34 observations were affected, consisting of 16 unique occurrences and 19 subsequent duplicates.
- These 19 duplicates spanned 16 distinct variations, meaning some variations had multiple duplicates.
- To clean the dataset, all 19 duplicate rows were removed, retaining only the first occurrence of each of the 16 unique variations.
- No missing data noted for any variable with Null.Count>0 and Fill.Rate<1.0.
- Low variance observed for 8 variables with First.Second.Mode.Ratio>5.
- Hx_Radiotherapy: First.Second.Mode.Ratio = 51.000 (comprised 2 category levels)
- M: First.Second.Mode.Ratio = 19.222 (comprised 2 category levels)
- Thyroid_Function: First.Second.Mode.Ratio = 15.650 (comprised 5 category levels)
- Hx_Smoking: First.Second.Mode.Ratio = 12.000 (comprised 2 category levels)
- Stage: First.Second.Mode.Ratio = 9.812 (comprised 5 category levels)
- Smoking: First.Second.Mode.Ratio = 6.428 (comprised 2 category levels)
- Pathology: First.Second.Mode.Ratio = 6.022 (comprised 4 category levels)
- Adenopathy: First.Second.Mode.Ratio = 5.375 (comprised 5 category levels)
- No low variance observed for any variable with Unique.Count.Ratio>10.
- No high skewness observed for any variable with Skewness>3 or Skewness<(-3).
##################################
# Counting the number of duplicated rows
##################################
thyroid_cancer.duplicated().sum()
19
##################################
# Exploring the duplicated rows
##################################
duplicated_rows = thyroid_cancer[thyroid_cancer.duplicated(keep=False)]
display(duplicated_rows)
Age | Gender | Smoking | Hx_Smoking | Hx_Radiotherapy | Thyroid_Function | Physical_Examination | Adenopathy | Pathology | Focality | Risk | T | N | M | Stage | Response | Recurred | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
8 | 51 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Micropapillary | Uni-Focal | Low | T1a | N0 | M0 | I | Excellent | No |
9 | 40 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Micropapillary | Uni-Focal | Low | T1a | N0 | M0 | I | Excellent | No |
22 | 36 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Micropapillary | Uni-Focal | Low | T1a | N0 | M0 | I | Excellent | No |
32 | 36 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Micropapillary | Uni-Focal | Low | T1a | N0 | M0 | I | Excellent | No |
38 | 40 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Micropapillary | Uni-Focal | Low | T1a | N0 | M0 | I | Excellent | No |
40 | 51 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Micropapillary | Uni-Focal | Low | T1a | N0 | M0 | I | Excellent | No |
61 | 35 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Papillary | Uni-Focal | Low | T1b | N0 | M0 | I | Excellent | No |
66 | 35 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Papillary | Uni-Focal | Low | T1b | N0 | M0 | I | Excellent | No |
67 | 51 | F | No | No | No | Euthyroid | Single nodular goiter-left | No | Papillary | Uni-Focal | Low | T1b | N0 | M0 | I | Excellent | No |
69 | 51 | F | No | No | No | Euthyroid | Single nodular goiter-left | No | Papillary | Uni-Focal | Low | T1b | N0 | M0 | I | Excellent | No |
73 | 29 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Papillary | Uni-Focal | Low | T1b | N0 | M0 | I | Excellent | No |
77 | 29 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Papillary | Uni-Focal | Low | T1b | N0 | M0 | I | Excellent | No |
106 | 26 | F | No | No | No | Euthyroid | Multinodular goiter | No | Papillary | Uni-Focal | Low | T2 | N0 | M0 | I | Excellent | No |
110 | 31 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Papillary | Uni-Focal | Low | T2 | N0 | M0 | I | Excellent | No |
113 | 32 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Papillary | Uni-Focal | Low | T2 | N0 | M0 | I | Excellent | No |
115 | 37 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Papillary | Uni-Focal | Low | T2 | N0 | M0 | I | Excellent | No |
119 | 28 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Papillary | Uni-Focal | Low | T2 | N0 | M0 | I | Excellent | No |
120 | 37 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Papillary | Uni-Focal | Low | T2 | N0 | M0 | I | Excellent | No |
121 | 26 | F | No | No | No | Euthyroid | Multinodular goiter | No | Papillary | Uni-Focal | Low | T2 | N0 | M0 | I | Excellent | No |
123 | 28 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Papillary | Uni-Focal | Low | T2 | N0 | M0 | I | Excellent | No |
132 | 32 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Papillary | Uni-Focal | Low | T2 | N0 | M0 | I | Excellent | No |
136 | 21 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Papillary | Uni-Focal | Low | T2 | N0 | M0 | I | Excellent | No |
137 | 32 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Papillary | Uni-Focal | Low | T2 | N0 | M0 | I | Excellent | No |
138 | 26 | F | No | No | No | Euthyroid | Multinodular goiter | No | Papillary | Uni-Focal | Low | T2 | N0 | M0 | I | Excellent | No |
142 | 42 | F | No | No | No | Euthyroid | Multinodular goiter | No | Papillary | Uni-Focal | Low | T2 | N0 | M0 | I | Excellent | No |
161 | 22 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Papillary | Uni-Focal | Low | T2 | N0 | M0 | I | Excellent | No |
166 | 31 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Papillary | Uni-Focal | Low | T2 | N0 | M0 | I | Excellent | No |
168 | 21 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Papillary | Uni-Focal | Low | T2 | N0 | M0 | I | Excellent | No |
170 | 38 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Papillary | Uni-Focal | Low | T2 | N0 | M0 | I | Excellent | No |
175 | 34 | F | No | No | No | Euthyroid | Multinodular goiter | No | Papillary | Uni-Focal | Low | T2 | N0 | M0 | I | Excellent | No |
178 | 38 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Papillary | Uni-Focal | Low | T2 | N0 | M0 | I | Excellent | No |
183 | 26 | F | No | No | No | Euthyroid | Multinodular goiter | No | Papillary | Uni-Focal | Low | T2 | N0 | M0 | I | Excellent | No |
187 | 34 | F | No | No | No | Euthyroid | Multinodular goiter | No | Papillary | Uni-Focal | Low | T2 | N0 | M0 | I | Excellent | No |
189 | 42 | F | No | No | No | Euthyroid | Multinodular goiter | No | Papillary | Uni-Focal | Low | T2 | N0 | M0 | I | Excellent | No |
196 | 22 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Papillary | Uni-Focal | Low | T2 | N0 | M0 | I | Excellent | No |
##################################
# Checking if duplicated rows have identical values across all columns
##################################
num_unique_dup_rows = duplicated_rows.drop_duplicates().shape[0]
num_total_dup_rows = duplicated_rows.shape[0]
if num_unique_dup_rows == 1:
print("All duplicated rows have the same values across all columns.")
else:
print(f"There are {num_unique_dup_rows} unique versions among the {num_total_dup_rows} duplicated rows.")
There are 16 unique versions among the 35 duplicated rows.
##################################
# Counting the unique variations among duplicated rows
##################################
unique_dup_variations = duplicated_rows.drop_duplicates()
variation_counts = duplicated_rows.value_counts().reset_index(name="Count")
print("Unique duplicated row variations and their counts:")
display(variation_counts)
Unique duplicated row variations and their counts:
Age | Gender | Smoking | Hx_Smoking | Hx_Radiotherapy | Thyroid_Function | Physical_Examination | Adenopathy | Pathology | Focality | Risk | T | N | M | Stage | Response | Recurred | Count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 26 | F | No | No | No | Euthyroid | Multinodular goiter | No | Papillary | Uni-Focal | Low | T2 | N0 | M0 | I | Excellent | No | 4 |
1 | 32 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Papillary | Uni-Focal | Low | T2 | N0 | M0 | I | Excellent | No | 3 |
2 | 21 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Papillary | Uni-Focal | Low | T2 | N0 | M0 | I | Excellent | No | 2 |
3 | 22 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Papillary | Uni-Focal | Low | T2 | N0 | M0 | I | Excellent | No | 2 |
4 | 28 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Papillary | Uni-Focal | Low | T2 | N0 | M0 | I | Excellent | No | 2 |
5 | 29 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Papillary | Uni-Focal | Low | T1b | N0 | M0 | I | Excellent | No | 2 |
6 | 31 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Papillary | Uni-Focal | Low | T2 | N0 | M0 | I | Excellent | No | 2 |
7 | 34 | F | No | No | No | Euthyroid | Multinodular goiter | No | Papillary | Uni-Focal | Low | T2 | N0 | M0 | I | Excellent | No | 2 |
8 | 35 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Papillary | Uni-Focal | Low | T1b | N0 | M0 | I | Excellent | No | 2 |
9 | 36 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Micropapillary | Uni-Focal | Low | T1a | N0 | M0 | I | Excellent | No | 2 |
10 | 37 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Papillary | Uni-Focal | Low | T2 | N0 | M0 | I | Excellent | No | 2 |
11 | 38 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Papillary | Uni-Focal | Low | T2 | N0 | M0 | I | Excellent | No | 2 |
12 | 40 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Micropapillary | Uni-Focal | Low | T1a | N0 | M0 | I | Excellent | No | 2 |
13 | 42 | F | No | No | No | Euthyroid | Multinodular goiter | No | Papillary | Uni-Focal | Low | T2 | N0 | M0 | I | Excellent | No | 2 |
14 | 51 | F | No | No | No | Euthyroid | Single nodular goiter-left | No | Papillary | Uni-Focal | Low | T1b | N0 | M0 | I | Excellent | No | 2 |
15 | 51 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Micropapillary | Uni-Focal | Low | T1a | N0 | M0 | I | Excellent | No | 2 |
##################################
# Removing the duplicated rows and
# retaining only the first occurrence
##################################
thyroid_cancer_row_filtered = thyroid_cancer.drop_duplicates(keep="first")
print('Dataset Dimensions: ')
display(thyroid_cancer_row_filtered.shape)
Dataset Dimensions:
(364, 17)
##################################
# Gathering the data types for each column
##################################
data_type_list = list(thyroid_cancer_row_filtered.dtypes)
##################################
# Gathering the variable names for each column
##################################
variable_name_list = list(thyroid_cancer_row_filtered.columns)
##################################
# Gathering the number of observations for each column
##################################
row_count_list = list([len(thyroid_cancer_row_filtered)] * len(thyroid_cancer_row_filtered.columns))
##################################
# Gathering the number of missing data for each column
##################################
null_count_list = list(thyroid_cancer_row_filtered.isna().sum(axis=0))
##################################
# Gathering the number of non-missing data for each column
##################################
non_null_count_list = list(thyroid_cancer_row_filtered.count())
##################################
# Gathering the missing data percentage for each column
##################################
fill_rate_list = map(truediv, non_null_count_list, row_count_list)
##################################
# Formulating the summary
# for all columns
##################################
all_column_quality_summary = pd.DataFrame(zip(variable_name_list,
data_type_list,
row_count_list,
non_null_count_list,
null_count_list,
fill_rate_list),
columns=['Column.Name',
'Column.Type',
'Row.Count',
'Non.Null.Count',
'Null.Count',
'Fill.Rate'])
display(all_column_quality_summary)
Column.Name | Column.Type | Row.Count | Non.Null.Count | Null.Count | Fill.Rate | |
---|---|---|---|---|---|---|
0 | Age | int64 | 364 | 364 | 0 | 1.0 |
1 | Gender | category | 364 | 364 | 0 | 1.0 |
2 | Smoking | category | 364 | 364 | 0 | 1.0 |
3 | Hx_Smoking | category | 364 | 364 | 0 | 1.0 |
4 | Hx_Radiotherapy | category | 364 | 364 | 0 | 1.0 |
5 | Thyroid_Function | category | 364 | 364 | 0 | 1.0 |
6 | Physical_Examination | category | 364 | 364 | 0 | 1.0 |
7 | Adenopathy | category | 364 | 364 | 0 | 1.0 |
8 | Pathology | category | 364 | 364 | 0 | 1.0 |
9 | Focality | category | 364 | 364 | 0 | 1.0 |
10 | Risk | category | 364 | 364 | 0 | 1.0 |
11 | T | category | 364 | 364 | 0 | 1.0 |
12 | N | category | 364 | 364 | 0 | 1.0 |
13 | M | category | 364 | 364 | 0 | 1.0 |
14 | Stage | category | 364 | 364 | 0 | 1.0 |
15 | Response | category | 364 | 364 | 0 | 1.0 |
16 | Recurred | category | 364 | 364 | 0 | 1.0 |
##################################
# Counting the number of columns
# with Fill.Rate < 1.00
##################################
len(all_column_quality_summary[(all_column_quality_summary['Fill.Rate']<1)])
0
##################################
# Identifying the rows
# with Fill.Rate < 0.90
##################################
column_low_fill_rate = all_column_quality_summary[(all_column_quality_summary['Fill.Rate']<0.90)]
##################################
# Gathering the indices for each observation
##################################
row_index_list = thyroid_cancer_row_filtered.index
##################################
# Gathering the number of columns for each observation
##################################
column_count_list = list([len(thyroid_cancer_row_filtered.columns)] * len(thyroid_cancer_row_filtered))
##################################
# Gathering the number of missing data for each row
##################################
null_row_list = list(thyroid_cancer_row_filtered.isna().sum(axis=1))
##################################
# Gathering the missing data percentage for each column
##################################
missing_rate_list = map(truediv, null_row_list, column_count_list)
##################################
# Identifying the rows
# with missing data
##################################
all_row_quality_summary = pd.DataFrame(zip(row_index_list,
column_count_list,
null_row_list,
missing_rate_list),
columns=['Row.Name',
'Column.Count',
'Null.Count',
'Missing.Rate'])
display(all_row_quality_summary)
Row.Name | Column.Count | Null.Count | Missing.Rate | |
---|---|---|---|---|
0 | 0 | 17 | 0 | 0.0 |
1 | 1 | 17 | 0 | 0.0 |
2 | 2 | 17 | 0 | 0.0 |
3 | 3 | 17 | 0 | 0.0 |
4 | 4 | 17 | 0 | 0.0 |
... | ... | ... | ... | ... |
359 | 378 | 17 | 0 | 0.0 |
360 | 379 | 17 | 0 | 0.0 |
361 | 380 | 17 | 0 | 0.0 |
362 | 381 | 17 | 0 | 0.0 |
363 | 382 | 17 | 0 | 0.0 |
364 rows × 4 columns
##################################
# Counting the number of rows
# with Missing.Rate > 0.00
##################################
len(all_row_quality_summary[(all_row_quality_summary['Missing.Rate']>0.00)])
0
##################################
# Formulating the dataset
# with numeric columns only
##################################
thyroid_cancer_numeric = thyroid_cancer_row_filtered.select_dtypes(include='number')
##################################
# Gathering the variable names for each numeric column
##################################
numeric_variable_name_list = thyroid_cancer_numeric.columns
##################################
# Gathering the minimum value for each numeric column
##################################
numeric_minimum_list = thyroid_cancer_numeric.min()
##################################
# Gathering the mean value for each numeric column
##################################
numeric_mean_list = thyroid_cancer_numeric.mean()
##################################
# Gathering the median value for each numeric column
##################################
numeric_median_list = thyroid_cancer_numeric.median()
##################################
# Gathering the maximum value for each numeric column
##################################
numeric_maximum_list = thyroid_cancer_numeric.max()
##################################
# Gathering the first mode values for each numeric column
##################################
numeric_first_mode_list = [thyroid_cancer_row_filtered[x].value_counts(dropna=True).index.tolist()[0] for x in thyroid_cancer_numeric]
##################################
# Gathering the second mode values for each numeric column
##################################
numeric_second_mode_list = [thyroid_cancer_row_filtered[x].value_counts(dropna=True).index.tolist()[1] for x in thyroid_cancer_numeric]
##################################
# Gathering the count of first mode values for each numeric column
##################################
numeric_first_mode_count_list = [thyroid_cancer_numeric[x].isin([thyroid_cancer_row_filtered[x].value_counts(dropna=True).index.tolist()[0]]).sum() for x in thyroid_cancer_numeric]
##################################
# Gathering the count of second mode values for each numeric column
##################################
numeric_second_mode_count_list = [thyroid_cancer_numeric[x].isin([thyroid_cancer_row_filtered[x].value_counts(dropna=True).index.tolist()[1]]).sum() for x in thyroid_cancer_numeric]
##################################
# Gathering the first mode to second mode ratio for each numeric column
##################################
numeric_first_second_mode_ratio_list = map(truediv, numeric_first_mode_count_list, numeric_second_mode_count_list)
##################################
# Gathering the count of unique values for each numeric column
##################################
numeric_unique_count_list = thyroid_cancer_numeric.nunique(dropna=True)
##################################
# Gathering the number of observations for each numeric column
##################################
numeric_row_count_list = list([len(thyroid_cancer_numeric)] * len(thyroid_cancer_numeric.columns))
##################################
# Gathering the unique to count ratio for each numeric column
##################################
numeric_unique_count_ratio_list = map(truediv, numeric_unique_count_list, numeric_row_count_list)
##################################
# Gathering the skewness value for each numeric column
##################################
numeric_skewness_list = thyroid_cancer_numeric.skew()
##################################
# Gathering the kurtosis value for each numeric column
##################################
numeric_kurtosis_list = thyroid_cancer_numeric.kurtosis()
##################################
# Generating a column quality summary for the numeric column
##################################
numeric_column_quality_summary = pd.DataFrame(zip(numeric_variable_name_list,
numeric_minimum_list,
numeric_mean_list,
numeric_median_list,
numeric_maximum_list,
numeric_first_mode_list,
numeric_second_mode_list,
numeric_first_mode_count_list,
numeric_second_mode_count_list,
numeric_first_second_mode_ratio_list,
numeric_unique_count_list,
numeric_row_count_list,
numeric_unique_count_ratio_list,
numeric_skewness_list,
numeric_kurtosis_list),
columns=['Numeric.Column.Name',
'Minimum',
'Mean',
'Median',
'Maximum',
'First.Mode',
'Second.Mode',
'First.Mode.Count',
'Second.Mode.Count',
'First.Second.Mode.Ratio',
'Unique.Count',
'Row.Count',
'Unique.Count.Ratio',
'Skewness',
'Kurtosis'])
display(numeric_column_quality_summary)
Numeric.Column.Name | Minimum | Mean | Median | Maximum | First.Mode | Second.Mode | First.Mode.Count | Second.Mode.Count | First.Second.Mode.Ratio | Unique.Count | Row.Count | Unique.Count.Ratio | Skewness | Kurtosis | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Age | 15 | 41.25 | 38.0 | 82 | 31 | 27 | 21 | 13 | 1.615385 | 65 | 364 | 0.178571 | 0.678269 | -0.359255 |
##################################
# Counting the number of numeric columns
# with First.Second.Mode.Ratio > 5.00
##################################
len(numeric_column_quality_summary[(numeric_column_quality_summary['First.Second.Mode.Ratio']>5)])
0
##################################
# Counting the number of numeric columns
# with Unique.Count.Ratio > 10.00
##################################
len(numeric_column_quality_summary[(numeric_column_quality_summary['Unique.Count.Ratio']>10)])
0
##################################
# Counting the number of numeric columns
# with Skewness > 3.00 or Skewness < -3.00
##################################
len(numeric_column_quality_summary[(numeric_column_quality_summary['Skewness']>3) | (numeric_column_quality_summary['Skewness']<(-3))])
0
##################################
# Formulating the dataset
# with categorical columns only
##################################
thyroid_cancer_categorical = thyroid_cancer_row_filtered.select_dtypes(include='category')
##################################
# Gathering the variable names for the categorical column
##################################
categorical_variable_name_list = thyroid_cancer_categorical.columns
##################################
# Gathering the first mode values for each categorical column
##################################
categorical_first_mode_list = [thyroid_cancer_row_filtered[x].value_counts().index.tolist()[0] for x in thyroid_cancer_categorical]
##################################
# Gathering the second mode values for each categorical column
##################################
categorical_second_mode_list = [thyroid_cancer_row_filtered[x].value_counts().index.tolist()[1] for x in thyroid_cancer_categorical]
##################################
# Gathering the count of first mode values for each categorical column
##################################
categorical_first_mode_count_list = [thyroid_cancer_categorical[x].isin([thyroid_cancer_row_filtered[x].value_counts(dropna=True).index.tolist()[0]]).sum() for x in thyroid_cancer_categorical]
##################################
# Gathering the count of second mode values for each categorical column
##################################
categorical_second_mode_count_list = [thyroid_cancer_categorical[x].isin([thyroid_cancer_row_filtered[x].value_counts(dropna=True).index.tolist()[1]]).sum() for x in thyroid_cancer_categorical]
##################################
# Gathering the first mode to second mode ratio for each categorical column
##################################
categorical_first_second_mode_ratio_list = map(truediv, categorical_first_mode_count_list, categorical_second_mode_count_list)
##################################
# Gathering the count of unique values for each categorical column
##################################
categorical_unique_count_list = thyroid_cancer_categorical.nunique(dropna=True)
##################################
# Gathering the number of observations for each categorical column
##################################
categorical_row_count_list = list([len(thyroid_cancer_categorical)] * len(thyroid_cancer_categorical.columns))
##################################
# Gathering the unique to count ratio for each categorical column
##################################
categorical_unique_count_ratio_list = map(truediv, categorical_unique_count_list, categorical_row_count_list)
##################################
# Generating a column quality summary for the categorical columns
##################################
categorical_column_quality_summary = pd.DataFrame(zip(categorical_variable_name_list,
categorical_first_mode_list,
categorical_second_mode_list,
categorical_first_mode_count_list,
categorical_second_mode_count_list,
categorical_first_second_mode_ratio_list,
categorical_unique_count_list,
categorical_row_count_list,
categorical_unique_count_ratio_list),
columns=['Categorical.Column.Name',
'First.Mode',
'Second.Mode',
'First.Mode.Count',
'Second.Mode.Count',
'First.Second.Mode.Ratio',
'Unique.Count',
'Row.Count',
'Unique.Count.Ratio'])
display(categorical_column_quality_summary)
Categorical.Column.Name | First.Mode | Second.Mode | First.Mode.Count | Second.Mode.Count | First.Second.Mode.Ratio | Unique.Count | Row.Count | Unique.Count.Ratio | |
---|---|---|---|---|---|---|---|---|---|
0 | Gender | F | M | 293 | 71 | 4.126761 | 2 | 364 | 0.005495 |
1 | Smoking | No | Yes | 315 | 49 | 6.428571 | 2 | 364 | 0.005495 |
2 | Hx_Smoking | No | Yes | 336 | 28 | 12.000000 | 2 | 364 | 0.005495 |
3 | Hx_Radiotherapy | No | Yes | 357 | 7 | 51.000000 | 2 | 364 | 0.005495 |
4 | Thyroid_Function | Euthyroid | Clinical Hyperthyroidism | 313 | 20 | 15.650000 | 5 | 364 | 0.013736 |
5 | Physical_Examination | Multinodular goiter | Single nodular goiter-right | 135 | 127 | 1.062992 | 5 | 364 | 0.013736 |
6 | Adenopathy | No | Right | 258 | 48 | 5.375000 | 6 | 364 | 0.016484 |
7 | Pathology | Papillary | Micropapillary | 271 | 45 | 6.022222 | 4 | 364 | 0.010989 |
8 | Focality | Uni-Focal | Multi-Focal | 228 | 136 | 1.676471 | 2 | 364 | 0.005495 |
9 | Risk | Low | Intermediate | 230 | 102 | 2.254902 | 3 | 364 | 0.008242 |
10 | T | T2 | T3a | 138 | 96 | 1.437500 | 7 | 364 | 0.019231 |
11 | N | N0 | N1b | 249 | 93 | 2.677419 | 3 | 364 | 0.008242 |
12 | M | M0 | M1 | 346 | 18 | 19.222222 | 2 | 364 | 0.005495 |
13 | Stage | I | II | 314 | 32 | 9.812500 | 5 | 364 | 0.013736 |
14 | Response | Excellent | Structural Incomplete | 189 | 91 | 2.076923 | 4 | 364 | 0.010989 |
15 | Recurred | No | Yes | 256 | 108 | 2.370370 | 2 | 364 | 0.005495 |
##################################
# Counting the number of categorical columns
# with First.Second.Mode.Ratio > 5.00
##################################
len(categorical_column_quality_summary[(categorical_column_quality_summary['First.Second.Mode.Ratio']>5)])
8
##################################
# Identifying the categorical columns
# with First.Second.Mode.Ratio > 5.00
##################################
display(categorical_column_quality_summary[(categorical_column_quality_summary['First.Second.Mode.Ratio']>5)].sort_values(by=['First.Second.Mode.Ratio'], ascending=False))
Categorical.Column.Name | First.Mode | Second.Mode | First.Mode.Count | Second.Mode.Count | First.Second.Mode.Ratio | Unique.Count | Row.Count | Unique.Count.Ratio | |
---|---|---|---|---|---|---|---|---|---|
3 | Hx_Radiotherapy | No | Yes | 357 | 7 | 51.000000 | 2 | 364 | 0.005495 |
12 | M | M0 | M1 | 346 | 18 | 19.222222 | 2 | 364 | 0.005495 |
4 | Thyroid_Function | Euthyroid | Clinical Hyperthyroidism | 313 | 20 | 15.650000 | 5 | 364 | 0.013736 |
2 | Hx_Smoking | No | Yes | 336 | 28 | 12.000000 | 2 | 364 | 0.005495 |
13 | Stage | I | II | 314 | 32 | 9.812500 | 5 | 364 | 0.013736 |
1 | Smoking | No | Yes | 315 | 49 | 6.428571 | 2 | 364 | 0.005495 |
7 | Pathology | Papillary | Micropapillary | 271 | 45 | 6.022222 | 4 | 364 | 0.010989 |
6 | Adenopathy | No | Right | 258 | 48 | 5.375000 | 6 | 364 | 0.016484 |
##################################
# Counting the number of categorical columns
# with Unique.Count.Ratio > 10.00
##################################
len(categorical_column_quality_summary[(categorical_column_quality_summary['Unique.Count.Ratio']>10)])
0
1.4. Data Preprocessing ¶
1.4.1 Data Splitting ¶
- The baseline dataset (with duplicate rows removed from the original dataset) is comprised of:
- 364 rows (observations)
- 256 Recurred=No: 70.33%
- 108 Recurred=Yes: 29.67%
- 17 columns (variables)
- 1/17 target (categorical)
- Recurred
- 1/17 predictor (numeric)
- Age
- 15/17 predictor (categorical)
- Gender
- Smoking
- Hx_Smoking
- Hx_Radiotherapy
- Thyroid_Function
- Physical_Examination
- Adenopathy
- Pathology
- Focality
- Risk
- T
- N
- M
- Stage
- Response
- 1/17 target (categorical)
- 364 rows (observations)
- The baseline dataset was divided into three subsets using a fixed random seed:
- test data: 25% of the original data with class stratification applied
- train data (initial): 75% of the original data with class stratification applied
- train data (final): 75% of the train (initial) data with class stratification applied
- validation data: 25% of the train (initial) data with class stratification applied
- Models were developed from the train data (final). Using the same dataset, a subset of models with optimal hyperparameters were selected, based on cross-validation.
- Among candidate models with optimal hyperparameters, the final model were selected based on performance on the validation data.
- Performance of the selected final model (and other candidate models for post-model selection comparison) were evaluated using the test data.
- The train data (final) subset is comprised of:
- 204 rows (observations)
- 143 Recurred=No: 70.10%
- 61 Recurred=Yes: 29.90%
- 17 columns (variables)
- 204 rows (observations)
- The validation data subset is comprised of:
- 69 rows (observations)
- 49 Recurred=No: 71.01%
- 20 Recurred=Yes: 28.98%
- 17 columns (variables)
- 69 rows (observations)
- The test data subset is comprised of:
- 91 rows (observations)
- 64 Recurred=No: 70.33%
- 27 Recurred=Yes: 29.67%
- 17 columns (variables)
- 91 rows (observations)
##################################
# Creating a dataset copy
# of the row filtered data
##################################
thyroid_cancer_baseline = thyroid_cancer_row_filtered.copy()
##################################
# Performing a general exploration
# of the baseline dataset
##################################
print('Final Dataset Dimensions: ')
display(thyroid_cancer_baseline.shape)
Final Dataset Dimensions:
(364, 17)
print('Target Variable Breakdown: ')
thyroid_cancer_breakdown = thyroid_cancer_baseline.groupby('Recurred', observed=True).size().reset_index(name='Count')
thyroid_cancer_breakdown['Percentage'] = (thyroid_cancer_breakdown['Count'] / len(thyroid_cancer_baseline)) * 100
display(thyroid_cancer_breakdown)
Target Variable Breakdown:
Recurred | Count | Percentage | |
---|---|---|---|
0 | No | 256 | 70.32967 |
1 | Yes | 108 | 29.67033 |
##################################
# Formulating the train and test data
# from the final dataset
# by applying stratification and
# using a 75-25 ratio
##################################
thyroid_cancer_train_initial, thyroid_cancer_test = train_test_split(thyroid_cancer_baseline,
test_size=0.25,
stratify=thyroid_cancer_baseline['Recurred'],
random_state=987654321)
##################################
# Performing a general exploration
# of the initial training dataset
##################################
X_train_initial = thyroid_cancer_train_initial.drop('Recurred', axis = 1)
y_train_initial = thyroid_cancer_train_initial['Recurred']
print('Initial Train Dataset Dimensions: ')
display(X_train_initial.shape)
display(y_train_initial.shape)
print('Initial Train Target Variable Breakdown: ')
display(y_train_initial.value_counts())
print('Initial Train Target Variable Proportion: ')
display(y_train_initial.value_counts(normalize = True))
Initial Train Dataset Dimensions:
(273, 16)
(273,)
Initial Train Target Variable Breakdown:
Recurred No 192 Yes 81 Name: count, dtype: int64
Initial Train Target Variable Proportion:
Recurred No 0.703297 Yes 0.296703 Name: proportion, dtype: float64
##################################
# Performing a general exploration
# of the test dataset
##################################
X_test = thyroid_cancer_test.drop('Recurred', axis = 1)
y_test = thyroid_cancer_test['Recurred']
print('Test Dataset Dimensions: ')
display(X_test.shape)
display(y_test.shape)
print('Test Target Variable Breakdown: ')
display(y_test.value_counts())
print('Test Target Variable Proportion: ')
display(y_test.value_counts(normalize = True))
Test Dataset Dimensions:
(91, 16)
(91,)
Test Target Variable Breakdown:
Recurred No 64 Yes 27 Name: count, dtype: int64
Test Target Variable Proportion:
Recurred No 0.703297 Yes 0.296703 Name: proportion, dtype: float64
##################################
# Formulating the train and validation data
# from the train dataset
# by applying stratification and
# using a 75-25 ratio
##################################
thyroid_cancer_train, thyroid_cancer_validation = train_test_split(thyroid_cancer_train_initial,
test_size=0.25,
stratify=thyroid_cancer_train_initial['Recurred'],
random_state=987654321)
##################################
# Performing a general exploration
# of the final training dataset
##################################
X_train = thyroid_cancer_train.drop('Recurred', axis = 1)
y_train = thyroid_cancer_train['Recurred']
print('Final Train Dataset Dimensions: ')
display(X_train.shape)
display(y_train.shape)
print('Final Train Target Variable Breakdown: ')
display(y_train.value_counts())
print('Final Train Target Variable Proportion: ')
display(y_train.value_counts(normalize = True))
Final Train Dataset Dimensions:
(204, 16)
(204,)
Final Train Target Variable Breakdown:
Recurred No 143 Yes 61 Name: count, dtype: int64
Final Train Target Variable Proportion:
Recurred No 0.70098 Yes 0.29902 Name: proportion, dtype: float64
##################################
# Performing a general exploration
# of the validation dataset
##################################
X_validation = thyroid_cancer_validation.drop('Recurred', axis = 1)
y_validation = thyroid_cancer_validation['Recurred']
print('Validation Dataset Dimensions: ')
display(X_validation.shape)
display(y_validation.shape)
print('Validation Target Variable Breakdown: ')
display(y_validation.value_counts())
print('Validation Target Variable Proportion: ')
display(y_validation.value_counts(normalize = True))
Validation Dataset Dimensions:
(69, 16)
(69,)
Validation Target Variable Breakdown:
Recurred No 49 Yes 20 Name: count, dtype: int64
Validation Target Variable Proportion:
Recurred No 0.710145 Yes 0.289855 Name: proportion, dtype: float64
##################################
# Saving the training data
# to the DATASETS_FINAL_TRAIN_PATH
# and DATASETS_FINAL_TRAIN_FEATURES_PATH
# and DATASETS_FINAL_TRAIN_TARGET_PATH
##################################
thyroid_cancer_train.to_csv(os.path.join("..", DATASETS_FINAL_TRAIN_PATH, "thyroid_cancer_train.csv"), index=False)
X_train.to_csv(os.path.join("..", DATASETS_FINAL_TRAIN_FEATURES_PATH, "X_train.csv"), index=False)
y_train.to_csv(os.path.join("..", DATASETS_FINAL_TRAIN_TARGET_PATH, "y_train.csv"), index=False)
##################################
# Saving the validation data
# to the DATASETS_FINAL_VALIDATION_PATH
# and DATASETS_FINAL_VALIDATION_FEATURE_PATH
# and DATASETS_FINAL_VALIDATION_TARGET_PATH
##################################
thyroid_cancer_validation.to_csv(os.path.join("..", DATASETS_FINAL_VALIDATION_PATH, "thyroid_cancer_validation.csv"), index=False)
X_validation.to_csv(os.path.join("..", DATASETS_FINAL_VALIDATION_FEATURES_PATH, "X_validation.csv"), index=False)
y_validation.to_csv(os.path.join("..", DATASETS_FINAL_VALIDATION_TARGET_PATH, "y_validation.csv"), index=False)
##################################
# Saving the test data
# to the DATASETS_FINAL_TEST_PATH
# and DATASETS_FINAL_TEST_FEATURES_PATH
# and DATASETS_FINAL_TEST_TARGET_PATH
##################################
thyroid_cancer_test.to_csv(os.path.join("..", DATASETS_FINAL_TEST_PATH, "thyroid_cancer_test.csv"), index=False)
X_test.to_csv(os.path.join("..", DATASETS_FINAL_TEST_FEATURES_PATH, "X_test.csv"), index=False)
y_test.to_csv(os.path.join("..", DATASETS_FINAL_TEST_TARGET_PATH, "y_test.csv"), index=False)
1.4.2 Data Profiling ¶
- No significant distributional anomalies were observed for the numeric predictor Age.
- 9 categorical predictors were observed with categories consisting of too few cases that risk poor generalization and cross-validation issues:
- Thyroid_Function:
- 171 Thyroid_Function=Euthyroid: 83.82%
- 10 Thyroid_Function=Subclinical Hypothyroidism: 4.90%
- 3 Thyroid_Function=Subclinical Hyperthyroidism: 1.47%
- 7 Thyroid_Function=Clinical Hypothyroidism: 3.43%
- 13 Thyroid_Function=Clinical Hyperthyroidism: 6.37%
- Physical_Examination:
- 4 Physical_Examination=Normal: 1.96%
- 50 Physical_Examination=Single nodular goiter-left: 24.50%
- 68 Physical_Examination=Single nodular goiter-right: 33.33%
- 79 Physical_Examination=Multinodular goiter: 38.72%
- 3 Physical_Examination=Diffuse goiter: 1.47%
- Adenopathy:
- 144 Adenopathy=No: 70.59%
- 14 Adenopathy=Left: 6.86%
- 21 Adenopathy=Right: 10.29%
- 19 Adenopathy=Bilateral: 9.31%
- 2 Adenopathy=Posterior: 9.84%
- 4 Adenopathy=Extensive: 1.96%
- Pathology:
- 15 Pathology=Hurthle Cell: 7.35%
- 14 Pathology=Follicular: 6.86%
- 26 Pathology=Micropapillary: 12.74%
- 149 Pathology=Papillary: 73.03%
- Risk:
- 127 Risk=Low: 62.25%
- 60 Risk=Intermediate: 29.41%
- 17 Risk=High: 8.33%
- T:
- 26 T=T1a: 12.74%
- 21 T=T1b: 10.29%
- 73 T=T2: 35.78%
- 58 T=T3a: 28.43%
- 10 T=T3b: 4.90%
- 12 T=T4a: 5.88%
- 4 T=T4b: 1.96%
- N:
- 139 N=N0: 68.13%
- 11 N=N1a: 5.39%
- 54 N=N1b: 26.47%
- Stage:
- 174 Stage=I: 85.29%
- 21 Stage=II: 10.29%
- 2 Stage=III: 0.98%
- 2 Stage=IVA: 0.98%
- 5 Stage=IVB: 2.45%
- Response:
- 109 Response=Excellent: 53.43%
- 53 Response=Structural Incomplete: 25.98%
- 8 Response=Biochemical Incomplete: 3.92%
- 34 Response=Indeterminate: 16.67%
- Thyroid_Function:
- 3 categorical predictors were excluded from the dataset after having been observed with extremely low variance containing categories with very few or almost no variations across observations that may have limited predictive power or drive increased model complexity without performance gains:
- Hx_Smoking:
- 193 Hx_Smoking=No: 94.61%
- 11 Hx_Smoking=Yes: 5.39%
- Hx_Radiotherapy:
- 202 Hx_Radiotherapy=No: 99.02%
- 2 Hx_Radiotherapy=Yes: 0.98%
- M:
- 194 M=M0: 95.10%
- 10 M=M1: 4.90%
- Hx_Smoking:
##################################
# Segregating the target
# and predictor variables
##################################
thyroid_cancer_train_predictors = thyroid_cancer_train.iloc[:,:-1].columns
thyroid_cancer_train_predictors_numeric = thyroid_cancer_train.iloc[:,:-1].loc[:, thyroid_cancer_train.iloc[:,:-1].columns == 'Age'].columns
thyroid_cancer_train_predictors_categorical = thyroid_cancer_train.iloc[:,:-1].loc[:,thyroid_cancer_train.iloc[:,:-1].columns != 'Age'].columns
##################################
# Gathering the variable names for each numeric column
##################################
numeric_variable_name_list = thyroid_cancer_train_predictors_numeric
##################################
# Segregating the target variable
# and numeric predictors
##################################
histogram_grouping_variable = 'Recurred'
histogram_frequency_variable = numeric_variable_name_list.values[0]
##################################
# Comparing the numeric predictors
# grouped by the target variable
##################################
colors = plt.get_cmap('tab10').colors
plt.figure(figsize=(7, 5))
group_no = thyroid_cancer_train[thyroid_cancer_train[histogram_grouping_variable] == 'No'][histogram_frequency_variable]
group_yes = thyroid_cancer_train[thyroid_cancer_train[histogram_grouping_variable] == 'Yes'][histogram_frequency_variable]
plt.hist(group_no, bins=20, alpha=0.5, color=colors[0], label='No', edgecolor='black')
plt.hist(group_yes, bins=20, alpha=0.5, color=colors[1], label='Yes', edgecolor='black')
plt.title(f'{histogram_grouping_variable} Versus {histogram_frequency_variable}')
plt.xlabel(histogram_frequency_variable)
plt.ylabel('Frequency')
plt.legend()
plt.show()
##################################
# Performing a general exploration of the categorical variable levels
# based on the ordered categories
##################################
ordered_cat_cols = thyroid_cancer_train.select_dtypes(include=["category"]).columns
for col in ordered_cat_cols:
print(f"Column: {col}")
print("Absolute Frequencies:")
print(thyroid_cancer_train[col].value_counts().reindex(thyroid_cancer_train[col].cat.categories))
print("\nNormalized Frequencies:")
print(thyroid_cancer_train[col].value_counts(normalize=True).reindex(thyroid_cancer_train[col].cat.categories))
print("-" * 50)
Column: Gender Absolute Frequencies: M 44 F 160 Name: count, dtype: int64 Normalized Frequencies: M 0.215686 F 0.784314 Name: proportion, dtype: float64 -------------------------------------------------- Column: Smoking Absolute Frequencies: No 177 Yes 27 Name: count, dtype: int64 Normalized Frequencies: No 0.867647 Yes 0.132353 Name: proportion, dtype: float64 -------------------------------------------------- Column: Hx_Smoking Absolute Frequencies: No 193 Yes 11 Name: count, dtype: int64 Normalized Frequencies: No 0.946078 Yes 0.053922 Name: proportion, dtype: float64 -------------------------------------------------- Column: Hx_Radiotherapy Absolute Frequencies: No 202 Yes 2 Name: count, dtype: int64 Normalized Frequencies: No 0.990196 Yes 0.009804 Name: proportion, dtype: float64 -------------------------------------------------- Column: Thyroid_Function Absolute Frequencies: Euthyroid 171 Subclinical Hypothyroidism 10 Subclinical Hyperthyroidism 3 Clinical Hypothyroidism 7 Clinical Hyperthyroidism 13 Name: count, dtype: int64 Normalized Frequencies: Euthyroid 0.838235 Subclinical Hypothyroidism 0.049020 Subclinical Hyperthyroidism 0.014706 Clinical Hypothyroidism 0.034314 Clinical Hyperthyroidism 0.063725 Name: proportion, dtype: float64 -------------------------------------------------- Column: Physical_Examination Absolute Frequencies: Normal 4 Single nodular goiter-left 50 Single nodular goiter-right 68 Multinodular goiter 79 Diffuse goiter 3 Name: count, dtype: int64 Normalized Frequencies: Normal 0.019608 Single nodular goiter-left 0.245098 Single nodular goiter-right 0.333333 Multinodular goiter 0.387255 Diffuse goiter 0.014706 Name: proportion, dtype: float64 -------------------------------------------------- Column: Adenopathy Absolute Frequencies: No 144 Left 14 Right 21 Bilateral 19 Posterior 2 Extensive 4 Name: count, dtype: int64 Normalized Frequencies: No 0.705882 Left 0.068627 Right 0.102941 Bilateral 0.093137 Posterior 0.009804 Extensive 0.019608 Name: proportion, dtype: float64 -------------------------------------------------- Column: Pathology Absolute Frequencies: Hurthle Cell 15 Follicular 14 Micropapillary 26 Papillary 149 Name: count, dtype: int64 Normalized Frequencies: Hurthle Cell 0.073529 Follicular 0.068627 Micropapillary 0.127451 Papillary 0.730392 Name: proportion, dtype: float64 -------------------------------------------------- Column: Focality Absolute Frequencies: Uni-Focal 129 Multi-Focal 75 Name: count, dtype: int64 Normalized Frequencies: Uni-Focal 0.632353 Multi-Focal 0.367647 Name: proportion, dtype: float64 -------------------------------------------------- Column: Risk Absolute Frequencies: Low 127 Intermediate 60 High 17 Name: count, dtype: int64 Normalized Frequencies: Low 0.622549 Intermediate 0.294118 High 0.083333 Name: proportion, dtype: float64 -------------------------------------------------- Column: T Absolute Frequencies: T1a 26 T1b 21 T2 73 T3a 58 T3b 10 T4a 12 T4b 4 Name: count, dtype: int64 Normalized Frequencies: T1a 0.127451 T1b 0.102941 T2 0.357843 T3a 0.284314 T3b 0.049020 T4a 0.058824 T4b 0.019608 Name: proportion, dtype: float64 -------------------------------------------------- Column: N Absolute Frequencies: N0 139 N1a 11 N1b 54 Name: count, dtype: int64 Normalized Frequencies: N0 0.681373 N1a 0.053922 N1b 0.264706 Name: proportion, dtype: float64 -------------------------------------------------- Column: M Absolute Frequencies: M0 194 M1 10 Name: count, dtype: int64 Normalized Frequencies: M0 0.95098 M1 0.04902 Name: proportion, dtype: float64 -------------------------------------------------- Column: Stage Absolute Frequencies: I 174 II 21 III 2 IVA 2 IVB 5 Name: count, dtype: int64 Normalized Frequencies: I 0.852941 II 0.102941 III 0.009804 IVA 0.009804 IVB 0.024510 Name: proportion, dtype: float64 -------------------------------------------------- Column: Response Absolute Frequencies: Excellent 109 Structural Incomplete 53 Biochemical Incomplete 8 Indeterminate 34 Name: count, dtype: int64 Normalized Frequencies: Excellent 0.534314 Structural Incomplete 0.259804 Biochemical Incomplete 0.039216 Indeterminate 0.166667 Name: proportion, dtype: float64 -------------------------------------------------- Column: Recurred Absolute Frequencies: No 143 Yes 61 Name: count, dtype: int64 Normalized Frequencies: No 0.70098 Yes 0.29902 Name: proportion, dtype: float64 --------------------------------------------------
##################################
# Segregating the target variable
# and categorical predictors
##################################
proportion_y_variables = thyroid_cancer_train_predictors_categorical
proportion_x_variable = 'Recurred'
##################################
# Defining the number of
# rows and columns for the subplots
##################################
num_rows = 5
num_cols = 3
##################################
# Formulating the subplot structure
##################################
fig, axes = plt.subplots(num_rows, num_cols, figsize=(15, 25))
##################################
# Flattening the multi-row and
# multi-column axes
##################################
axes = axes.ravel()
##################################
# Formulating the individual stacked column plots
# for all categorical columns
##################################
for i, y_variable in enumerate(proportion_y_variables):
ax = axes[i]
category_counts = thyroid_cancer_train.groupby([proportion_x_variable, y_variable], observed=True).size().unstack(fill_value=0)
category_proportions = category_counts.div(category_counts.sum(axis=1), axis=0)
category_proportions.plot(kind='bar', stacked=True, ax=ax)
ax.set_title(f'{proportion_x_variable} Versus {y_variable}')
ax.set_xlabel(proportion_x_variable)
ax.set_ylabel('Proportions')
ax.legend(loc="lower center")
##################################
# Adjusting the subplot layout
##################################
plt.tight_layout()
##################################
# Presenting the subplots
##################################
plt.show()
##################################
# Removing predictors observed with extreme
# near-zero variance and a limited number of levels
##################################
thyroid_cancer_train_column_filtered = thyroid_cancer_train.drop(columns=['Hx_Radiotherapy','M','Hx_Smoking'])
thyroid_cancer_train_column_filtered.head()
Age | Gender | Smoking | Thyroid_Function | Physical_Examination | Adenopathy | Pathology | Focality | Risk | T | N | Stage | Response | Recurred | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
140 | 28 | F | No | Euthyroid | Multinodular goiter | No | Papillary | Uni-Focal | Low | T2 | N0 | I | Excellent | No |
205 | 36 | F | No | Euthyroid | Single nodular goiter-right | Right | Papillary | Uni-Focal | Low | T2 | N1b | I | Indeterminate | No |
277 | 41 | M | Yes | Euthyroid | Single nodular goiter-right | No | Hurthle Cell | Multi-Focal | Intermediate | T3a | N0 | I | Excellent | No |
294 | 42 | M | No | Subclinical Hypothyroidism | Single nodular goiter-right | No | Papillary | Multi-Focal | Intermediate | T3a | N1a | I | Indeterminate | No |
268 | 32 | F | No | Euthyroid | Single nodular goiter-left | No | Papillary | Uni-Focal | Low | T3a | N0 | I | Excellent | No |
1.4.3 Category Aggregration and Encoding ¶
- Category aggregation was applied to the previously identified categorical predictors observed with many levels (high-cardinality) containing only a few observations to improve model stability during cross-validation and enhance generalization:
- Thyroid_Function:
- 171 Thyroid_Function=Euthyroid: 83.82%
- 33 Thyroid_Function=Hypothyroidism or Hyperthyroidism: 16.18%
- Physical_Examination:
- 122 Physical_Examination=Normal or Single Nodular Goiter : 59.80%
- 82 Physical_Examination=Multinodular or Diffuse Goiter: 40.20%
- Adenopathy:
- 144 Adenopathy=No: 70.59%
- 60 Adenopathy=Yes: 29.41%
- Pathology:
- 29 Pathology=Non-Papillary : 14.22%
- 175 Pathology=Papillary: 85.78%
- Risk:
- 127 Risk=Low: 62.25%
- 77 Risk=Intermediate to High: 37.75%
- T:
- 120 T=T1 to T2: 58.82%
- 84 T=T3 to T4b: 41.18%
- N:
- 139 N=N0: 68.14%
- 65 N=N1: 31.86%
- Stage:
- 174 Stage=I: 85.29%
- 30 Stage=II to IVB: 14.71%
- Response:
- 109 Response=Excellent: 53.43%
- 95 Response=Indeterminate or Incomplete: 46.57%
- Thyroid_Function:
##################################
# Merging small categories into broader groups
# for certain categorical predictors
# to ensure sufficient representation in statistical models
# and prevent sparsity issues in cross-validation
##################################
thyroid_cancer_train_column_filtered['Thyroid_Function'] = thyroid_cancer_train_column_filtered['Thyroid_Function'].map(lambda x: 'Euthyroid' if (x in ['Euthyroid']) else 'Hypothyroidism or Hyperthyroidism').astype('category')
thyroid_cancer_train_column_filtered['Physical_Examination'] = thyroid_cancer_train_column_filtered['Physical_Examination'].map(lambda x: 'Normal or Single Nodular Goiter' if (x in ['Normal', 'Single nodular goiter-left', 'Single nodular goiter-right']) else 'Multinodular or Diffuse Goiter').astype('category')
thyroid_cancer_train_column_filtered['Adenopathy'] = thyroid_cancer_train_column_filtered['Adenopathy'].map(lambda x: 'No' if x == 'No' else ('Yes' if pd.notna(x) and x != '' else x)).astype('category')
thyroid_cancer_train_column_filtered['Pathology'] = thyroid_cancer_train_column_filtered['Pathology'].map(lambda x: 'Non-Papillary' if (x in ['Hurthle Cell', 'Follicular']) else 'Papillary').astype('category')
thyroid_cancer_train_column_filtered['Risk'] = thyroid_cancer_train_column_filtered['Risk'].map(lambda x: 'Low' if (x in ['Low']) else 'Intermediate to High').astype('category')
thyroid_cancer_train_column_filtered['T'] = thyroid_cancer_train_column_filtered['T'].map(lambda x: 'T1 to T2' if (x in ['T1a', 'T1b', 'T2']) else 'T3 to T4b').astype('category')
thyroid_cancer_train_column_filtered['N'] = thyroid_cancer_train_column_filtered['N'].map(lambda x: 'N0' if (x in ['N0']) else 'N1').astype('category')
thyroid_cancer_train_column_filtered['Stage'] = thyroid_cancer_train_column_filtered['Stage'].map(lambda x: 'I' if (x in ['I']) else 'II to IVB').astype('category')
thyroid_cancer_train_column_filtered['Response'] = thyroid_cancer_train_column_filtered['Response'].map(lambda x: 'Indeterminate or Incomplete' if (x in ['Indeterminate', 'Structural Incomplete', 'Biochemical Incomplete']) else 'Excellent').astype('category')
thyroid_cancer_train_column_filtered.head()
Age | Gender | Smoking | Thyroid_Function | Physical_Examination | Adenopathy | Pathology | Focality | Risk | T | N | Stage | Response | Recurred | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
140 | 28 | F | No | Euthyroid | Multinodular or Diffuse Goiter | No | Papillary | Uni-Focal | Low | T1 to T2 | N0 | I | Excellent | No |
205 | 36 | F | No | Euthyroid | Normal or Single Nodular Goiter | Yes | Papillary | Uni-Focal | Low | T1 to T2 | N1 | I | Indeterminate or Incomplete | No |
277 | 41 | M | Yes | Euthyroid | Normal or Single Nodular Goiter | No | Non-Papillary | Multi-Focal | Intermediate to High | T3 to T4b | N0 | I | Excellent | No |
294 | 42 | M | No | Hypothyroidism or Hyperthyroidism | Normal or Single Nodular Goiter | No | Papillary | Multi-Focal | Intermediate to High | T3 to T4b | N1 | I | Indeterminate or Incomplete | No |
268 | 32 | F | No | Euthyroid | Normal or Single Nodular Goiter | No | Papillary | Uni-Focal | Low | T3 to T4b | N0 | I | Excellent | No |
##################################
# Performing a general exploration of the categorical variable levels
# based on the ordered categories
##################################
ordered_cat_cols = thyroid_cancer_train_column_filtered.select_dtypes(include=["category"]).columns
for col in ordered_cat_cols:
print(f"Column: {col}")
print("Absolute Frequencies:")
print(thyroid_cancer_train_column_filtered[col].value_counts().reindex(thyroid_cancer_train_column_filtered[col].cat.categories))
print("\nNormalized Frequencies:")
print(thyroid_cancer_train_column_filtered[col].value_counts(normalize=True).reindex(thyroid_cancer_train_column_filtered[col].cat.categories))
print("-" * 50)
Column: Gender Absolute Frequencies: M 44 F 160 Name: count, dtype: int64 Normalized Frequencies: M 0.215686 F 0.784314 Name: proportion, dtype: float64 -------------------------------------------------- Column: Smoking Absolute Frequencies: No 177 Yes 27 Name: count, dtype: int64 Normalized Frequencies: No 0.867647 Yes 0.132353 Name: proportion, dtype: float64 -------------------------------------------------- Column: Thyroid_Function Absolute Frequencies: Euthyroid 171 Hypothyroidism or Hyperthyroidism 33 Name: count, dtype: int64 Normalized Frequencies: Euthyroid 0.838235 Hypothyroidism or Hyperthyroidism 0.161765 Name: proportion, dtype: float64 -------------------------------------------------- Column: Physical_Examination Absolute Frequencies: Multinodular or Diffuse Goiter 82 Normal or Single Nodular Goiter 122 Name: count, dtype: int64 Normalized Frequencies: Multinodular or Diffuse Goiter 0.401961 Normal or Single Nodular Goiter 0.598039 Name: proportion, dtype: float64 -------------------------------------------------- Column: Adenopathy Absolute Frequencies: No 144 Yes 60 Name: count, dtype: int64 Normalized Frequencies: No 0.705882 Yes 0.294118 Name: proportion, dtype: float64 -------------------------------------------------- Column: Pathology Absolute Frequencies: Non-Papillary 29 Papillary 175 Name: count, dtype: int64 Normalized Frequencies: Non-Papillary 0.142157 Papillary 0.857843 Name: proportion, dtype: float64 -------------------------------------------------- Column: Focality Absolute Frequencies: Uni-Focal 129 Multi-Focal 75 Name: count, dtype: int64 Normalized Frequencies: Uni-Focal 0.632353 Multi-Focal 0.367647 Name: proportion, dtype: float64 -------------------------------------------------- Column: Risk Absolute Frequencies: Intermediate to High 77 Low 127 Name: count, dtype: int64 Normalized Frequencies: Intermediate to High 0.377451 Low 0.622549 Name: proportion, dtype: float64 -------------------------------------------------- Column: T Absolute Frequencies: T1 to T2 120 T3 to T4b 84 Name: count, dtype: int64 Normalized Frequencies: T1 to T2 0.588235 T3 to T4b 0.411765 Name: proportion, dtype: float64 -------------------------------------------------- Column: N Absolute Frequencies: N0 139 N1 65 Name: count, dtype: int64 Normalized Frequencies: N0 0.681373 N1 0.318627 Name: proportion, dtype: float64 -------------------------------------------------- Column: Stage Absolute Frequencies: I 174 II to IVB 30 Name: count, dtype: int64 Normalized Frequencies: I 0.852941 II to IVB 0.147059 Name: proportion, dtype: float64 -------------------------------------------------- Column: Response Absolute Frequencies: Excellent 109 Indeterminate or Incomplete 95 Name: count, dtype: int64 Normalized Frequencies: Excellent 0.534314 Indeterminate or Incomplete 0.465686 Name: proportion, dtype: float64 -------------------------------------------------- Column: Recurred Absolute Frequencies: No 143 Yes 61 Name: count, dtype: int64 Normalized Frequencies: No 0.70098 Yes 0.29902 Name: proportion, dtype: float64 --------------------------------------------------
##################################
# Segregating the target
# and predictor variables
##################################
thyroid_cancer_train_predictors = thyroid_cancer_train_column_filtered.iloc[:,:-1].columns
thyroid_cancer_train_predictors_numeric = thyroid_cancer_train_column_filtered.iloc[:,:-1].loc[:, thyroid_cancer_train_column_filtered.iloc[:,:-1].columns == 'Age'].columns
thyroid_cancer_train_predictors_categorical = thyroid_cancer_train_column_filtered.iloc[:,:-1].loc[:,thyroid_cancer_train_column_filtered.iloc[:,:-1].columns != 'Age'].columns
##################################
# Segregating the target variable
# and categorical predictors
##################################
proportion_y_variables = thyroid_cancer_train_predictors_categorical
proportion_x_variable = 'Recurred'
##################################
# Defining the number of
# rows and columns for the subplots
##################################
num_rows = 4
num_cols = 3
##################################
# Formulating the subplot structure
##################################
fig, axes = plt.subplots(num_rows, num_cols, figsize=(15, 20))
##################################
# Flattening the multi-row and
# multi-column axes
##################################
axes = axes.ravel()
##################################
# Formulating the individual stacked column plots
# for all categorical columns
##################################
for i, y_variable in enumerate(proportion_y_variables):
ax = axes[i]
category_counts = thyroid_cancer_train_column_filtered.groupby([proportion_x_variable, y_variable], observed=True).size().unstack(fill_value=0)
category_proportions = category_counts.div(category_counts.sum(axis=1), axis=0)
category_proportions.plot(kind='bar', stacked=True, ax=ax)
ax.set_title(f'{proportion_x_variable} Versus {y_variable}')
ax.set_xlabel(proportion_x_variable)
ax.set_ylabel('Proportions')
ax.legend(loc="lower center")
##################################
# Adjusting the subplot layout
##################################
plt.tight_layout()
##################################
# Presenting the subplots
##################################
plt.show()
1.4.4 Outlier and Distributional Shape Analysis ¶
- No outliers (Outlier.Count>0, Outlier.Ratio>0.000), high skewness (Skewness>3 or Skewness<(-3)) or abnormal kurtosis (Skewness>2 or Skewness<(-2)) observed for the numeric predictor.
- Age: Outlier.Count = 0, Outlier.Ratio = 0.000, Skewness = 0.525, Kurtosis = -0.494
##################################
# Formulating the imputed dataset
# with numeric columns only
##################################
thyroid_cancer_train_column_filtered['Age'] = pd.to_numeric(thyroid_cancer_train_column_filtered['Age'])
thyroid_cancer_train_column_filtered_numeric = thyroid_cancer_train_column_filtered.select_dtypes(include='number')
thyroid_cancer_train_column_filtered_numeric = thyroid_cancer_train_column_filtered_numeric.to_frame() if isinstance(thyroid_cancer_train_column_filtered_numeric, pd.Series) else thyroid_cancer_train_column_filtered_numeric
##################################
# Gathering the variable names for each numeric column
##################################
numeric_variable_name_list = list(thyroid_cancer_train_column_filtered_numeric.columns)
##################################
# Gathering the skewness value for each numeric column
##################################
numeric_skewness_list = thyroid_cancer_train_column_filtered_numeric.skew()
##################################
# Computing the interquartile range
# for all columns
##################################
thyroid_cancer_train_column_filtered_numeric_q1 = thyroid_cancer_train_column_filtered_numeric.quantile(0.25)
thyroid_cancer_train_column_filtered_numeric_q3 = thyroid_cancer_train_column_filtered_numeric.quantile(0.75)
thyroid_cancer_train_column_filtered_numeric_iqr = thyroid_cancer_train_column_filtered_numeric_q3 - thyroid_cancer_train_column_filtered_numeric_q1
##################################
# Gathering the outlier count for each numeric column
# based on the interquartile range criterion
##################################
numeric_outlier_count_list = ((thyroid_cancer_train_column_filtered_numeric < (thyroid_cancer_train_column_filtered_numeric_q1 - 1.5 * thyroid_cancer_train_column_filtered_numeric_iqr)) | (thyroid_cancer_train_column_filtered_numeric > (thyroid_cancer_train_column_filtered_numeric_q3 + 1.5 * thyroid_cancer_train_column_filtered_numeric_iqr))).sum()
##################################
# Gathering the number of observations for each column
##################################
numeric_row_count_list = list([len(thyroid_cancer_train_column_filtered_numeric)] * len(thyroid_cancer_train_column_filtered_numeric.columns))
##################################
# Gathering the unique to count ratio for each numeric column
##################################
numeric_outlier_ratio_list = map(truediv, numeric_outlier_count_list, numeric_row_count_list)
##################################
# Gathering the skewness value for each numeric column
##################################
numeric_skewness_list = thyroid_cancer_train_column_filtered_numeric.skew()
##################################
# Gathering the kurtosis value for each numeric column
##################################
numeric_kurtosis_list = thyroid_cancer_train_column_filtered_numeric.kurtosis()
##################################
# Formulating the outlier summary
# for all numeric columns
##################################
numeric_column_outlier_summary = pd.DataFrame(zip(numeric_variable_name_list,
numeric_skewness_list,
numeric_outlier_count_list,
numeric_row_count_list,
numeric_outlier_ratio_list,
numeric_skewness_list,
numeric_kurtosis_list),
columns=['Numeric.Column.Name',
'Skewness',
'Outlier.Count',
'Row.Count',
'Outlier.Ratio',
'Skewness',
'Kurtosis'])
display(numeric_column_outlier_summary)
Numeric.Column.Name | Skewness | Outlier.Count | Row.Count | Outlier.Ratio | Skewness | Kurtosis | |
---|---|---|---|---|---|---|---|
0 | Age | 0.525218 | 0 | 204 | 0.0 | 0.525218 | -0.494286 |
##################################
# Formulating the individual boxplots
# for all numeric columns
##################################
for column in thyroid_cancer_train_column_filtered_numeric:
plt.figure(figsize=(17,1))
sns.boxplot(data=thyroid_cancer_train_column_filtered_numeric, x=column)
1.4.5 Collinearity ¶
- Majority of the predictors reported low (<0.50) to moderate (0.50 to 0.75) correlation.
- Among pairwise combinations of categorical predictors, high Phi.Coefficient values were noted for:
- N and Adenopathy: Phi.Coefficient = +0.805
- N and Risk: Phi.Coefficient = +0.726
- Adenopathy and Risk: Phi.Coefficient = +0.674
##################################
# Creating a dataset copy and
# converting all values to numeric
# for correlation analysis
##################################
pd.set_option('future.no_silent_downcasting', True)
thyroid_cancer_train_correlation = thyroid_cancer_train_column_filtered.copy()
thyroid_cancer_train_correlation_object = thyroid_cancer_train_correlation.iloc[:,1:13].columns
custom_category_orders = {
'Gender': ['M', 'F'],
'Smoking': ['No', 'Yes'],
'Thyroid_Function': ['Euthyroid', 'Hypothyroidism or Hyperthyroidism'],
'Physical_Examination': ['Normal or Single Nodular Goiter', 'Multinodular or Diffuse Goiter'],
'Adenopathy': ['No', 'Yes'],
'Pathology': ['Non-Papillary', 'Papillary'],
'Focality': ['Uni-Focal', 'Multi-Focal'],
'Risk': ['Low', 'Intermediate to High'],
'T': ['T1 to T2', 'T3 to T4b'],
'N': ['N0', 'N1'],
'Stage': ['I', 'II to IVB'],
'Response': ['Excellent', 'Indeterminate or Incomplete']
}
encoder = OrdinalEncoder(categories=[custom_category_orders[col] for col in thyroid_cancer_train_correlation_object])
thyroid_cancer_train_correlation[thyroid_cancer_train_correlation_object] = encoder.fit_transform(
thyroid_cancer_train_correlation[thyroid_cancer_train_correlation_object]
)
thyroid_cancer_train_correlation = thyroid_cancer_train_correlation.drop(['Recurred'], axis=1)
display(thyroid_cancer_train_correlation)
Age | Gender | Smoking | Thyroid_Function | Physical_Examination | Adenopathy | Pathology | Focality | Risk | T | N | Stage | Response | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
140 | 28 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
205 | 36 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 |
277 | 41 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 |
294 | 42 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 | 1.0 |
268 | 32 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
300 | 67 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 1.0 | 1.0 |
115 | 37 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
67 | 51 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
161 | 22 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
55 | 21 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
204 rows × 13 columns
##################################
# Initializing the correlation matrix
##################################
thyroid_cancer_train_correlation_matrix = pd.DataFrame(np.zeros((len(thyroid_cancer_train_correlation.columns), len(thyroid_cancer_train_correlation.columns))),
columns=thyroid_cancer_train_correlation.columns,
index=thyroid_cancer_train_correlation.columns)
##################################
# Creating an empty correlation matrix
##################################
thyroid_cancer_train_correlation_matrix = pd.DataFrame(
np.zeros((len(thyroid_cancer_train_correlation.columns), len(thyroid_cancer_train_correlation.columns))),
index=thyroid_cancer_train_correlation.columns,
columns=thyroid_cancer_train_correlation.columns
)
##################################
# Calculating different types
# of correlation coefficients
# per variable type
##################################
for i in range(len(thyroid_cancer_train_correlation.columns)):
for j in range(i, len(thyroid_cancer_train_correlation.columns)):
if i == j:
thyroid_cancer_train_correlation_matrix.iloc[i, j] = 1.0
else:
col_i = thyroid_cancer_train_correlation.iloc[:, i]
col_j = thyroid_cancer_train_correlation.iloc[:, j]
# Detecting binary variables (assumes binary variables are coded as 0/1)
is_binary_i = col_i.nunique() == 2
is_binary_j = col_j.nunique() == 2
# Computing the Pearson correlation for two continuous variables
if col_i.dtype in ['int64', 'float64'] and col_j.dtype in ['int64', 'float64']:
corr = col_i.corr(col_j)
# Computing the Point-Biserial correlation for continuous and binary variables
elif (col_i.dtype in ['int64', 'float64'] and is_binary_j) or (col_j.dtype in ['int64', 'float64'] and is_binary_i):
continuous_var = col_i if col_i.dtype in ['int64', 'float64'] else col_j
binary_var = col_j if is_binary_j else col_i
# Convert binary variable to 0/1 (if not already)
binary_var = binary_var.astype('category').cat.codes
corr, _ = pointbiserialr(continuous_var, binary_var)
# Computing the Phi coefficient for two binary variables
elif is_binary_i and is_binary_j:
corr = col_i.corr(col_j)
# Computing the Cramér's V for two categorical variables (if more than 2 categories)
else:
contingency_table = pd.crosstab(col_i, col_j)
chi2, _, _, _ = chi2_contingency(contingency_table)
n = contingency_table.sum().sum()
phi2 = chi2 / n
r, k = contingency_table.shape
corr = np.sqrt(phi2 / min(k - 1, r - 1)) # Cramér's V formula
# Assigning correlation values to the matrix
thyroid_cancer_train_correlation_matrix.iloc[i, j] = corr
thyroid_cancer_train_correlation_matrix.iloc[j, i] = corr
# Displaying the correlation matrix
display(thyroid_cancer_train_correlation_matrix)
Age | Gender | Smoking | Thyroid_Function | Physical_Examination | Adenopathy | Pathology | Focality | Risk | T | N | Stage | Response | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Age | 1.000000 | -0.185530 | 0.299971 | 0.077845 | 0.012021 | 0.073931 | -0.215274 | 0.195272 | 0.205360 | 0.246838 | 0.013195 | 0.528144 | 0.317978 |
Gender | -0.185530 | 1.000000 | -0.604101 | -0.093290 | -0.031935 | -0.158480 | 0.127817 | -0.218103 | -0.255507 | -0.215101 | -0.178550 | -0.219727 | -0.179431 |
Smoking | 0.299971 | -0.604101 | 1.000000 | 0.064124 | 0.004339 | 0.192350 | -0.338086 | 0.182212 | 0.233024 | 0.231679 | 0.105463 | 0.327952 | 0.215362 |
Thyroid_Function | 0.077845 | -0.093290 | 0.064124 | 1.000000 | 0.019964 | -0.137486 | -0.049893 | 0.051564 | -0.012519 | -0.042960 | -0.043275 | 0.080702 | -0.036498 |
Physical_Examination | 0.012021 | -0.031935 | 0.004339 | 0.019964 | 1.000000 | 0.063246 | 0.018806 | 0.245779 | 0.166012 | 0.086039 | 0.104553 | 0.054799 | 0.116526 |
Adenopathy | 0.073931 | -0.158480 | 0.192350 | -0.137486 | 0.063246 | 1.000000 | 0.047117 | 0.288750 | 0.673638 | 0.421762 | 0.805406 | 0.278749 | 0.518887 |
Pathology | -0.215274 | 0.127817 | -0.338086 | -0.049893 | 0.018806 | 0.047117 | 1.000000 | -0.126299 | -0.117392 | -0.286899 | 0.157869 | -0.187683 | -0.154637 |
Focality | 0.195272 | -0.218103 | 0.182212 | 0.051564 | 0.245779 | 0.288750 | -0.126299 | 1.000000 | 0.454926 | 0.518864 | 0.307716 | 0.372331 | 0.388741 |
Risk | 0.205360 | -0.255507 | 0.233024 | -0.012519 | 0.166012 | 0.673638 | -0.117392 | 0.454926 | 1.000000 | 0.622459 | 0.726304 | 0.533264 | 0.631330 |
T | 0.246838 | -0.215101 | 0.231679 | -0.042960 | 0.086039 | 0.421762 | -0.286899 | 0.518864 | 0.622459 | 1.000000 | 0.368430 | 0.468168 | 0.556742 |
N | 0.013195 | -0.178550 | 0.105463 | -0.043275 | 0.104553 | 0.805406 | 0.157869 | 0.307716 | 0.726304 | 0.368430 | 1.000000 | 0.310156 | 0.542672 |
Stage | 0.528144 | -0.219727 | 0.327952 | 0.080702 | 0.054799 | 0.278749 | -0.187683 | 0.372331 | 0.533264 | 0.468168 | 0.310156 | 1.000000 | 0.417025 |
Response | 0.317978 | -0.179431 | 0.215362 | -0.036498 | 0.116526 | 0.518887 | -0.154637 | 0.388741 | 0.631330 | 0.556742 | 0.542672 | 0.417025 | 1.000000 |
##################################
# Plotting the correlation matrix
# for all pairwise combinations
# of numeric and categorical columns
##################################
plt.figure(figsize=(17, 8))
sns.heatmap(thyroid_cancer_train_correlation_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.show()
1.5. Data Exploration ¶
1.5.1 Exploratory Data Analysis ¶
- Bivariate analysis identified individual predictors with generally positive association to the target variable based on visual inspection.
- Higher values or higher proportions for the following predictors are associated with the Recurred=Yes category:
- Age
- Gender=M
- Smoking=Yes
- Physical_Examination=Multinodular or Diffuse Goiter
- Adenopathy=Yes
- Focality=Multi-Focal
- Risk=Intermediate to High
- T=T3 to T4b
- N=N1
- Stage=II to IVB
- Response=Indeterminate or Incomplete
- Proportions for the following predictors are not associated with the Recurred=Yes or Recurred=No categories:
- Thyroid_Function
- Pathology
##################################
# Segregating the target
# and predictor variables
##################################
thyroid_cancer_train_column_filtered_predictors = thyroid_cancer_train_column_filtered.iloc[:,:-1].columns
thyroid_cancer_train_column_filtered_predictors_numeric = thyroid_cancer_train_column_filtered.iloc[:,:-1].loc[:, thyroid_cancer_train_column_filtered.iloc[:,:-1].columns == 'Age'].columns
thyroid_cancer_train_column_filtered_predictors_categorical = thyroid_cancer_train_column_filtered.iloc[:,:-1].loc[:,thyroid_cancer_train_column_filtered.iloc[:,:-1].columns != 'Age'].columns
##################################
# Gathering the variable names for each numeric column
##################################
numeric_variable_name_list = thyroid_cancer_train_column_filtered_predictors_numeric
##################################
# Segregating the target variable
# and numeric predictors
##################################
boxplot_y_variable = 'Recurred'
boxplot_x_variable = numeric_variable_name_list.values[0]
##################################
# Evaluating the numeric predictors
# against the target variable
##################################
plt.figure(figsize=(7, 5))
plt.boxplot([group[boxplot_x_variable] for name, group in thyroid_cancer_train_column_filtered.groupby(boxplot_y_variable, observed=True)])
plt.title(f'{boxplot_y_variable} Versus {boxplot_x_variable}')
plt.xlabel(boxplot_y_variable)
plt.ylabel(boxplot_x_variable)
plt.xticks(range(1, len(thyroid_cancer_train_column_filtered[boxplot_y_variable].unique()) + 1), ['No', 'Yes'])
plt.show()
##################################
# Segregating the target variable
# and categorical predictors
##################################
proportion_y_variables = thyroid_cancer_train_column_filtered_predictors_categorical
proportion_x_variable = 'Recurred'
##################################
# Defining the number of
# rows and columns for the subplots
##################################
num_rows = 4
num_cols = 3
##################################
# Formulating the subplot structure
##################################
fig, axes = plt.subplots(num_rows, num_cols, figsize=(15, 20))
##################################
# Flattening the multi-row and
# multi-column axes
##################################
axes = axes.ravel()
##################################
# Formulating the individual stacked column plots
# for all categorical columns
##################################
for i, y_variable in enumerate(proportion_y_variables):
ax = axes[i]
category_counts = thyroid_cancer_train_column_filtered.groupby([proportion_x_variable, y_variable], observed=True).size().unstack(fill_value=0)
category_proportions = category_counts.div(category_counts.sum(axis=1), axis=0)
category_proportions.plot(kind='bar', stacked=True, ax=ax)
ax.set_title(f'{proportion_x_variable} Versus {y_variable}')
ax.set_xlabel(proportion_x_variable)
ax.set_ylabel('Proportions')
ax.legend(loc="lower center")
##################################
# Adjusting the subplot layout
##################################
plt.tight_layout()
##################################
# Presenting the subplots
##################################
plt.show()
1.5.2 Hypothesis Testing ¶
- The relationship between the numeric predictor to the Recurred target variable was statistically evaluated using the following hypotheses:
- Null: Difference in the means between groups Yes and No is equal to zero
- Alternative: Difference in the means between groups Yes and No is not equal to zero
- There is sufficient evidence to conclude of a statistically significant difference between the means of the numeric measurements obtained from Yes and No groups of the Recurred target variable in 1 of 1 numeric predictor given its high t-test statistic values with reported low p-values less than the significance level of 0.05.
- Age: T.Test.Statistic=-3.791, T.Test.PValue=0.000
- The relationship between the categorical predictors to the Recurred target variable was statistically evaluated using the following hypotheses:
- Null: The categorical predictor is independent of the categorical target variable
- Alternative: The categorical predictor is dependent of the categorical target variable
- There is sufficient evidence to conclude of a statistically significant relationship between the categories of the categorical predictors and the Yes and No groups of the Recurred target variable in 9 of 12 categorical predictors given their high chisquare statistic values with reported low p-values less than the significance level of 0.05.
- Risk: ChiSquare.Test.Statistic=98.599, ChiSquare.Test.PValue=0.000
- Response: ChiSquare.Test.Statistic=90.866, ChiSquare.Test.PValue=0.000
- Adenopathy: ChiSquare.Test.Statistic=73.585, ChiSquare.Test.PValue=0.000
- N: ChiSquare.Test.Statistic=73.176, ChiSquare.Test.PValue=0.000
- T: ChiSquare.Test.Statistic=62.205, ChiSquare.Test.PValue=0.000
- Stage: ChiSquare.Test.Statistic=44.963, ChiSquare.Test.PValue=0.000
- Focality: ChiSquare.Test.Statistic=32.859, ChiSquare.Test.PValue=0.000
- Gender: ChiSquare.Test.Statistic=17.787, ChiSquare.Test.PValue=0.000
- Smoking: ChiSquare.Test.Statistic=14.460, ChiSquare.Test.PValue=0.001
- There is marginal evidence to conclude of a statistically significant relationship between the categories of the categorical predictors and the Yes and No groups of the Recurred target variable in 1 of 12 categorical predictors given its sufficiently high chisquare statistic values with reported low p-values near the significance level of 0.10.
- Physical_Examination: ChiSquare.Test.Statistic=2.413, ChiSquare.Test.PValue=0.120
##################################
# Computing the t-test
# statistic and p-values
# between the target variable
# and numeric predictor columns
##################################
thyroid_cancer_numeric_ttest_target = {}
thyroid_cancer_numeric = thyroid_cancer_train_column_filtered.loc[:,(thyroid_cancer_train_column_filtered.columns == 'Age') | (thyroid_cancer_train_column_filtered.columns == 'Recurred')]
thyroid_cancer_numeric_columns = thyroid_cancer_train_column_filtered_predictors_numeric
for numeric_column in thyroid_cancer_numeric_columns:
group_0 = thyroid_cancer_numeric[thyroid_cancer_numeric.loc[:,'Recurred']=='No']
group_1 = thyroid_cancer_numeric[thyroid_cancer_numeric.loc[:,'Recurred']=='Yes']
thyroid_cancer_numeric_ttest_target['Recurred_' + numeric_column] = stats.ttest_ind(
group_0[numeric_column],
group_1[numeric_column],
equal_var=True)
##################################
# Formulating the pairwise ttest summary
# between the target variable
# and numeric predictor columns
##################################
thyroid_cancer_numeric_summary = thyroid_cancer_numeric.from_dict(thyroid_cancer_numeric_ttest_target, orient='index')
thyroid_cancer_numeric_summary.columns = ['T.Test.Statistic', 'T.Test.PValue']
display(thyroid_cancer_numeric_summary.sort_values(by=['T.Test.PValue'], ascending=True).head(len(thyroid_cancer_train_column_filtered_predictors_numeric)))
T.Test.Statistic | T.Test.PValue | |
---|---|---|
Recurred_Age | -3.747942 | 0.000233 |
##################################
# Computing the chisquare
# statistic and p-values
# between the target variable
# and categorical predictor columns
##################################
thyroid_cancer_categorical_chisquare_target = {}
thyroid_cancer_categorical = thyroid_cancer_train_column_filtered.loc[:,(thyroid_cancer_train_column_filtered.columns != 'Age') | (thyroid_cancer_train_column_filtered.columns == 'Recurred')]
thyroid_cancer_categorical_columns = thyroid_cancer_train_column_filtered_predictors_categorical
for categorical_column in thyroid_cancer_categorical_columns:
contingency_table = pd.crosstab(thyroid_cancer_categorical[categorical_column],
thyroid_cancer_categorical['Recurred'])
thyroid_cancer_categorical_chisquare_target['Recurred_' + categorical_column] = stats.chi2_contingency(
contingency_table)[0:2]
##################################
# Formulating the pairwise chisquare summary
# between the target variable
# and categorical predictor columns
##################################
thyroid_cancer_categorical_summary = thyroid_cancer_categorical.from_dict(thyroid_cancer_categorical_chisquare_target, orient='index')
thyroid_cancer_categorical_summary.columns = ['ChiSquare.Test.Statistic', 'ChiSquare.Test.PValue']
display(thyroid_cancer_categorical_summary.sort_values(by=['ChiSquare.Test.PValue'], ascending=True).head(len(thyroid_cancer_train_column_filtered_predictors_categorical)))
ChiSquare.Test.Statistic | ChiSquare.Test.PValue | |
---|---|---|
Recurred_Risk | 98.599608 | 3.090804e-23 |
Recurred_Response | 90.866461 | 1.537030e-21 |
Recurred_Adenopathy | 73.585561 | 9.636704e-18 |
Recurred_N | 73.176134 | 1.185810e-17 |
Recurred_T | 62.205367 | 3.094435e-15 |
Recurred_Stage | 44.963917 | 2.006987e-11 |
Recurred_Focality | 32.859398 | 9.907099e-09 |
Recurred_Gender | 17.787641 | 2.469824e-05 |
Recurred_Smoking | 14.460357 | 1.431406e-04 |
Recurred_Physical_Examination | 2.413115 | 1.203227e-01 |
Recurred_Thyroid_Function | 0.966826 | 3.254729e-01 |
Recurred_Pathology | 0.131614 | 7.167646e-01 |
1.6. Premodelling Data Preparation ¶
1.6.1 Preprocessed Data Description¶
- A total of 6 of the 16 predictors were excluded from the dataset based on the data preprocessing and exploration findings
- There were 3 categorical predictors excluded from the dataset after having been observed with extremely low variance containing categories with very few or almost no variations across observations that may have limited predictive power or drive increased model complexity without performance gains:
- Hx_Smoking:
- 193 Hx_Smoking=No: 94.61%
- 11 Hx_Smoking=Yes: 5.39%
- Hx_Radiotherapy:
- 202 Hx_Radiotherapy=No: 99.02%
- 2 Hx_Radiotherapy=Yes: 0.98%
- M:
- 194 M=M0: 95.10%
- 10 M=M1: 4.90%
- Hx_Smoking:
- There was 1 categorical predictor excluded from the dataset after having been observed with high pairwise collinearity (Phi.Coefficient>0.70) with other 2 predictors that might provide redundant information, leading to potential instability in regression models.
- N and Adenopathy: Phi.Coefficient = +0.805
- N and Risk: Phi.Coefficient = +0.726
- Another 2 categorical predictors were excluded from the dataset for not exhibiting a statistically significant association with the Yes and No groups of the Recurred target variable, indicating weak predictive value.
- Thyroid_Function: ChiSquare.Test.Statistic=0.967, ChiSquare.Test.PValue=0.325
- Pathology: ChiSquare.Test.Statistic=0.132, ChiSquare.Test.PValue=0.717
- The preprocessed train data (final) subset is comprised of:
- 204 rows (observations)
- 143 Recurred=No: 70.10%
- 61 Recurred=Yes: 29.90%
- 11 columns (variables)
- 1/11 target (categorical)
- Recurred
- 1/11 predictor (numeric)
- Age
- 9/11 predictor (categorical)
- Gender
- Smoking
- Physical_Examination
- Adenopathy
- Focality
- Risk
- T
- M
- Stage
- Response
- 1/11 target (categorical)
- 204 rows (observations)
1.6.2 Preprocessing Pipeline Development¶
- A preprocessing pipeline was formulated and applied to the train data (final), validation data and test data with the following actions:
- Excluded specified columns noted with low variance, high collinearity and weak predictive power
- Aggregated categories in multiclass categorical variables into binary levels
- Converted categorical columns to the appropriate type
- Set the order of category levels for ordinal encoding during modeling pipeline creation
##################################
# Formulating a preprocessing pipeline
# that removes the specified columns,
# aggregates categories in multiclass categorical variables,
# converts categorical columns to the appropriate type, and
# sets the order of category levels
##################################
def preprocess_dataset(df):
# Removing the specified columns
columns_to_remove = ['Hx_Smoking', 'Hx_Radiotherapy', 'M', 'N', 'Thyroid_Function', 'Pathology']
df = df.drop(columns=columns_to_remove)
# Applying category aggregation
df['Physical_Examination'] = df['Physical_Examination'].map(
lambda x: 'Normal or Single Nodular Goiter' if x in ['Normal', 'Single nodular goiter-left', 'Single nodular goiter-right']
else 'Multinodular or Diffuse Goiter').astype('category')
df['Adenopathy'] = df['Adenopathy'].map(
lambda x: 'No' if x == 'No' else ('Yes' if pd.notna(x) and x != '' else x)).astype('category')
df['Risk'] = df['Risk'].map(
lambda x: 'Low' if x == 'Low' else 'Intermediate to High').astype('category')
df['T'] = df['T'].map(
lambda x: 'T1 to T2' if x in ['T1a', 'T1b', 'T2'] else 'T3 to T4b').astype('category')
df['Stage'] = df['Stage'].map(
lambda x: 'I' if x == 'I' else 'II to IVB').astype('category')
df['Response'] = df['Response'].map(
lambda x: 'Indeterminate or Incomplete' if x in ['Indeterminate', 'Structural Incomplete', 'Biochemical Incomplete']
else 'Excellent').astype('category')
# Setting category levels
category_mappings = {
'Gender': ['M', 'F'],
'Smoking': ['No', 'Yes'],
'Physical_Examination': ['Normal or Single Nodular Goiter', 'Multinodular or Diffuse Goiter'],
'Adenopathy': ['No', 'Yes'],
'Focality': ['Uni-Focal', 'Multi-Focal'],
'Risk': ['Low', 'Intermediate to High'],
'T': ['T1 to T2', 'T3 to T4b'],
'Stage': ['I', 'II to IVB'],
'Response': ['Excellent', 'Indeterminate or Incomplete']
}
for col, categories in category_mappings.items():
df[col] = df[col].astype('category')
df[col] = df[col].cat.set_categories(categories, ordered=True)
return df
##################################
# Applying the preprocessing pipeline
# to the train data
##################################
thyroid_cancer_preprocessed_train = preprocess_dataset(thyroid_cancer_train)
X_preprocessed_train = thyroid_cancer_preprocessed_train.drop('Recurred', axis = 1)
y_preprocessed_train = thyroid_cancer_preprocessed_train['Recurred']
thyroid_cancer_preprocessed_train.to_csv(os.path.join("..", DATASETS_PREPROCESSED_TRAIN_PATH, "thyroid_cancer_preprocessed_train.csv"), index=False)
X_preprocessed_train.to_csv(os.path.join("..", DATASETS_PREPROCESSED_TRAIN_FEATURES_PATH, "X_preprocessed_train.csv"), index=False)
y_preprocessed_train.to_csv(os.path.join("..", DATASETS_PREPROCESSED_TRAIN_TARGET_PATH, "y_preprocessed_train.csv"), index=False)
print('Final Preprocessed Train Dataset Dimensions: ')
display(X_preprocessed_train.shape)
display(y_preprocessed_train.shape)
print('Final Preprocessed Train Target Variable Breakdown: ')
display(y_preprocessed_train.value_counts())
print('Final Preprocessed Train Target Variable Proportion: ')
display(y_preprocessed_train.value_counts(normalize = True))
thyroid_cancer_preprocessed_train.head()
Final Preprocessed Train Dataset Dimensions:
(204, 10)
(204,)
Final Preprocessed Train Target Variable Breakdown:
Recurred No 143 Yes 61 Name: count, dtype: int64
Final Preprocessed Train Target Variable Proportion:
Recurred No 0.70098 Yes 0.29902 Name: proportion, dtype: float64
Age | Gender | Smoking | Physical_Examination | Adenopathy | Focality | Risk | T | Stage | Response | Recurred | |
---|---|---|---|---|---|---|---|---|---|---|---|
140 | 28 | F | No | Multinodular or Diffuse Goiter | No | Uni-Focal | Low | T1 to T2 | I | Excellent | No |
205 | 36 | F | No | Normal or Single Nodular Goiter | Yes | Uni-Focal | Low | T1 to T2 | I | Indeterminate or Incomplete | No |
277 | 41 | M | Yes | Normal or Single Nodular Goiter | No | Multi-Focal | Intermediate to High | T3 to T4b | I | Excellent | No |
294 | 42 | M | No | Normal or Single Nodular Goiter | No | Multi-Focal | Intermediate to High | T3 to T4b | I | Indeterminate or Incomplete | No |
268 | 32 | F | No | Normal or Single Nodular Goiter | No | Uni-Focal | Low | T3 to T4b | I | Excellent | No |
##################################
# Applying the preprocessing pipeline
# to the validation data
##################################
thyroid_cancer_preprocessed_validation = preprocess_dataset(thyroid_cancer_validation)
X_preprocessed_validation = thyroid_cancer_preprocessed_validation.drop('Recurred', axis = 1)
y_preprocessed_validation = thyroid_cancer_preprocessed_validation['Recurred']
thyroid_cancer_preprocessed_validation.to_csv(os.path.join("..", DATASETS_PREPROCESSED_VALIDATION_PATH, "thyroid_cancer_preprocessed_validation.csv"), index=False)
X_preprocessed_validation.to_csv(os.path.join("..", DATASETS_PREPROCESSED_VALIDATION_FEATURES_PATH, "X_preprocessed_validation.csv"), index=False)
y_preprocessed_validation.to_csv(os.path.join("..", DATASETS_PREPROCESSED_VALIDATION_TARGET_PATH, "y_preprocessed_validation.csv"), index=False)
print('Final Preprocessed Validation Dataset Dimensions: ')
display(X_preprocessed_validation.shape)
display(y_preprocessed_validation.shape)
print('Final Preprocessed Validation Target Variable Breakdown: ')
display(y_preprocessed_validation.value_counts())
print('Final Preprocessed Validation Target Variable Proportion: ')
display(y_preprocessed_validation.value_counts(normalize = True))
thyroid_cancer_preprocessed_validation.head()
Final Preprocessed Validation Dataset Dimensions:
(69, 10)
(69,)
Final Preprocessed Validation Target Variable Breakdown:
Recurred No 49 Yes 20 Name: count, dtype: int64
Final Preprocessed Validation Target Variable Proportion:
Recurred No 0.710145 Yes 0.289855 Name: proportion, dtype: float64
Age | Gender | Smoking | Physical_Examination | Adenopathy | Focality | Risk | T | Stage | Response | Recurred | |
---|---|---|---|---|---|---|---|---|---|---|---|
173 | 30 | F | No | Normal or Single Nodular Goiter | No | Uni-Focal | Low | T1 to T2 | I | Indeterminate or Incomplete | No |
164 | 29 | F | No | Normal or Single Nodular Goiter | No | Multi-Focal | Low | T1 to T2 | I | Excellent | No |
256 | 21 | M | Yes | Normal or Single Nodular Goiter | No | Uni-Focal | Low | T3 to T4b | I | Indeterminate or Incomplete | No |
348 | 58 | F | No | Multinodular or Diffuse Goiter | Yes | Multi-Focal | Intermediate to High | T3 to T4b | II to IVB | Indeterminate or Incomplete | Yes |
131 | 31 | F | No | Normal or Single Nodular Goiter | No | Uni-Focal | Low | T1 to T2 | I | Excellent | No |
##################################
# Applying the preprocessing pipeline
# to the test data
##################################
thyroid_cancer_preprocessed_test = preprocess_dataset(thyroid_cancer_test)
X_preprocessed_test = thyroid_cancer_preprocessed_test.drop('Recurred', axis = 1)
y_preprocessed_test = thyroid_cancer_preprocessed_test['Recurred']
thyroid_cancer_preprocessed_test.to_csv(os.path.join("..", DATASETS_PREPROCESSED_TEST_PATH, "thyroid_cancer_preprocessed_test.csv"), index=False)
X_preprocessed_test.to_csv(os.path.join("..", DATASETS_PREPROCESSED_TEST_FEATURES_PATH, "X_preprocessed_test.csv"), index=False)
y_preprocessed_test.to_csv(os.path.join("..", DATASETS_PREPROCESSED_TEST_TARGET_PATH, "y_preprocessed_test.csv"), index=False)
print('Final Preprocessed Test Dataset Dimensions: ')
display(X_preprocessed_test.shape)
display(y_preprocessed_test.shape)
print('Final Preprocessed Test Target Variable Breakdown: ')
display(y_preprocessed_test.value_counts())
print('Final Preprocessed Test Target Variable Proportion: ')
display(y_preprocessed_test.value_counts(normalize = True))
thyroid_cancer_preprocessed_test.head()
Final Preprocessed Test Dataset Dimensions:
(91, 10)
(91,)
Final Preprocessed Test Target Variable Breakdown:
Recurred No 64 Yes 27 Name: count, dtype: int64
Final Preprocessed Test Target Variable Proportion:
Recurred No 0.703297 Yes 0.296703 Name: proportion, dtype: float64
Age | Gender | Smoking | Physical_Examination | Adenopathy | Focality | Risk | T | Stage | Response | Recurred | |
---|---|---|---|---|---|---|---|---|---|---|---|
345 | 25 | F | No | Multinodular or Diffuse Goiter | Yes | Multi-Focal | Intermediate to High | T3 to T4b | I | Indeterminate or Incomplete | Yes |
249 | 46 | F | No | Normal or Single Nodular Goiter | No | Multi-Focal | Low | T3 to T4b | I | Excellent | No |
83 | 40 | F | No | Normal or Single Nodular Goiter | No | Uni-Focal | Intermediate to High | T1 to T2 | I | Excellent | No |
184 | 67 | F | No | Normal or Single Nodular Goiter | No | Uni-Focal | Low | T1 to T2 | I | Excellent | No |
146 | 25 | F | No | Multinodular or Diffuse Goiter | No | Uni-Focal | Low | T1 to T2 | I | Indeterminate or Incomplete | No |
##################################
# Defining a function to compute
# model performance
##################################
def model_performance_evaluation(y_true, y_pred):
metric_name = ['Accuracy','Precision','Recall','F1','AUROC']
metric_value = [accuracy_score(y_true, y_pred),
precision_score(y_true, y_pred),
recall_score(y_true, y_pred),
f1_score(y_true, y_pred),
roc_auc_score(y_true, y_pred)]
metric_summary = pd.DataFrame(zip(metric_name, metric_value),
columns=['metric_name','metric_value'])
return(metric_summary)
1.7. Bagged Model Development ¶
Bagging (Boostrap Aggregating) is an ensemble learning technique that reduces model variance by training multiple instances of the same algorithm on different randomly sampled subsets of the training data. The fundamental problem bagging aims to solve is overfitting, particularly in high-variance models. By generating multiple bootstrap samples—random subsets created through sampling with replacement — bagging ensures that each model is trained on slightly different data, making the overall prediction more stable. In classification problems, the final output is obtained by majority voting among the individual models, while in regression, their predictions are averaged. Bagging is particularly effective when dealing with noisy datasets, as it smooths out individual model errors. However, its effectiveness is limited for low-variance models, and the requirement to train multiple models increases computational cost.
1.7.1 Random Forest ¶
Random Forest is an ensemble learning method that builds multiple decision trees and combines their outputs to improve prediction accuracy and robustness in binary classification. Instead of relying on a single decision tree, it aggregates multiple trees, reducing overfitting and increasing generalizability. The algorithm works by training individual decision trees on bootstrapped samples of the dataset, where each tree is trained on a slightly different subset of data. Additionally, at each decision node, a random subset of features is considered for splitting, adding further diversity among the trees. The final classification is determined by majority voting across all trees. The main advantages of Random Forest include its resilience to overfitting, ability to handle high-dimensional data, and robustness against noisy data. However, it has limitations, such as higher computational cost due to multiple trees and reduced interpretability compared to a single decision tree. It can also struggle with highly imbalanced data unless additional techniques like class weighting are applied.
- The random forest model from the sklearn.ensemble Python library API was implemented.
- The model contains 4 hyperparameters for tuning:
- criterion = function to measure the quality of a split made to vary between gini and entropy
- max_depth = maximum depth of the tree made to vary between 3 and 6
- min_samples_leaf = minimum number of samples required to be at a leaf node made to vary between 5 and 10
- n_estimators = number of base estimators in the ensemble made to vary between 100 and 200
- A special hyperparameter (class_weight = balanced) was fixed to address the minimal 2:1 class imbalance observed between the No and Yes Recurred categories.
- Hyperparameter tuning was conducted using the 5-cycle 5-fold cross-validation method with optimal model performance using the F1 score determined for:
- criterion = entropy
- max_depth = 6
- min_samples_leaf = 10
- n_estimators = 200
- The apparent model performance of the optimal model is summarized as follows:
- Accuracy = 0.8985
- Precision = 0.7826
- Recall = 0.9000
- F1 Score = 0.8372
- AUROC = 0.8989
- The independent validation model performance of the optimal model is summarized as follows:
- Accuracy = 0.8985
- Precision = 0.7826
- Recall = 0.9000
- F1 Score = 0.8372
- AUROC = 0.8989
- Sufficiently comparable apparent and independent validation model performance observed that might be indicative of the absence of excessive model overfitting.
##################################
# Defining the categorical preprocessing parameters
##################################
categorical_features = ['Gender','Smoking','Physical_Examination','Adenopathy','Focality','Risk','T','Stage','Response']
categorical_transformer = OrdinalEncoder()
categorical_preprocessor = ColumnTransformer(transformers=[
('cat', categorical_transformer, categorical_features)],
remainder='passthrough',
force_int_remainder_cols=False)
##################################
# Defining the preprocessing and modeling pipeline parameters
##################################
bagged_rf_pipeline = Pipeline([
('categorical_preprocessor', categorical_preprocessor),
('bagged_rf_model', RandomForestClassifier(class_weight='balanced',
random_state=987654321))
])
##################################
# Defining hyperparameter grid
##################################
bagged_rf_hyperparameter_grid = {
'bagged_rf_model__criterion': ['gini', 'entropy'],
'bagged_rf_model__max_depth': [3, 6],
'bagged_rf_model__min_samples_leaf': [5, 10],
'bagged_rf_model__n_estimators': [100, 200]
}
##################################
# Defining the cross-validation strategy (5-cycle 5-fold CV)
##################################
cv_strategy = RepeatedStratifiedKFold(n_splits=5,
n_repeats=5,
random_state=987654321)
##################################
# Performing Grid Search with cross-validation
##################################
bagged_rf_grid_search = GridSearchCV(
estimator=bagged_rf_pipeline,
param_grid=bagged_rf_hyperparameter_grid,
scoring='f1',
cv=cv_strategy,
n_jobs=-1,
verbose=1
)
##################################
# Encoding the response variables
# for model evaluation
##################################
y_encoder = OrdinalEncoder()
y_encoder.fit(y_preprocessed_train.values.reshape(-1, 1))
y_preprocessed_train_encoded = y_encoder.transform(y_preprocessed_train.values.reshape(-1, 1)).ravel()
y_preprocessed_validation_encoded = y_encoder.transform(y_preprocessed_validation.values.reshape(-1, 1)).ravel()
##################################
# Fitting GridSearchCV
##################################
bagged_rf_grid_search.fit(X_preprocessed_train, y_preprocessed_train_encoded)
Fitting 25 folds for each of 16 candidates, totalling 400 fits
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321), estimator=Pipeline(steps=[('categorical_preprocessor', ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])])), ('bagged_rf_model', RandomForestClassifier(class_weight='balanced', random_state=987654321))]), n_jobs=-1, param_grid={'bagged_rf_model__criterion': ['gini', 'entropy'], 'bagged_rf_model__max_depth': [3, 6], 'bagged_rf_model__min_samples_leaf': [5, 10], 'bagged_rf_model__n_estimators': [100, 200]}, scoring='f1', verbose=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321), estimator=Pipeline(steps=[('categorical_preprocessor', ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])])), ('bagged_rf_model', RandomForestClassifier(class_weight='balanced', random_state=987654321))]), n_jobs=-1, param_grid={'bagged_rf_model__criterion': ['gini', 'entropy'], 'bagged_rf_model__max_depth': [3, 6], 'bagged_rf_model__min_samples_leaf': [5, 10], 'bagged_rf_model__n_estimators': [100, 200]}, scoring='f1', verbose=1)
Pipeline(steps=[('categorical_preprocessor', ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])])), ('bagged_rf_model', RandomForestClassifier(class_weight='balanced', criterion='entropy', max_depth=6, min_samples_leaf=10, n_estimators=200, random_state=987654321))])
ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])])
['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response']
OrdinalEncoder()
['Age']
passthrough
RandomForestClassifier(class_weight='balanced', criterion='entropy', max_depth=6, min_samples_leaf=10, n_estimators=200, random_state=987654321)
##################################
# Identifying the best model
##################################
bagged_rf_optimal = bagged_rf_grid_search.best_estimator_
##################################
# Evaluating the F1 scores
# on the training, cross-validation, and validation data
##################################
bagged_rf_optimal_f1_cv = bagged_rf_grid_search.best_score_
bagged_rf_optimal_f1_train = f1_score(y_preprocessed_train_encoded, bagged_rf_optimal.predict(X_preprocessed_train))
bagged_rf_optimal_f1_validation = f1_score(y_preprocessed_validation_encoded, bagged_rf_optimal.predict(X_preprocessed_validation))
##################################
# Identifying the optimal model
##################################
print('Best Bagged Model - Random Forest: ')
print(f"Best Random Forest Hyperparameters: {bagged_rf_grid_search.best_params_}")
Best Bagged Model - Random Forest: Best Random Forest Hyperparameters: {'bagged_rf_model__criterion': 'entropy', 'bagged_rf_model__max_depth': 6, 'bagged_rf_model__min_samples_leaf': 10, 'bagged_rf_model__n_estimators': 200}
##################################
# Summarizing the F1 score results
# and classification metrics
# on the training and cross-validated data
# to assess overfitting optimism
##################################
print(f"F1 Score on Cross-Validated Data: {bagged_rf_optimal_f1_cv:.4f}")
print(f"F1 Score on Training Data: {bagged_rf_optimal_f1_train:.4f}")
print("\nClassification Report on Train Data:\n", classification_report(y_preprocessed_train_encoded, bagged_rf_optimal.predict(X_preprocessed_train)))
F1 Score on Cross-Validated Data: 0.8218 F1 Score on Training Data: 0.8333 Classification Report on Train Data: precision recall f1-score support 0.0 0.95 0.89 0.92 143 1.0 0.77 0.90 0.83 61 accuracy 0.89 204 macro avg 0.86 0.89 0.88 204 weighted avg 0.90 0.89 0.89 204
##################################
# Formulating the raw and normalized
# confusion matrices
# from the train data
##################################
cm_raw = confusion_matrix(y_preprocessed_train_encoded, bagged_rf_optimal.predict(X_preprocessed_train))
cm_normalized = confusion_matrix(y_preprocessed_train_encoded, bagged_rf_optimal.predict(X_preprocessed_train), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Random Forest Train Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Random Forest Train Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
# to assess overfitting optimism
##################################
print(f"F1 Score on Validation Data: {bagged_rf_optimal_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_preprocessed_validation_encoded, bagged_rf_optimal.predict(X_preprocessed_validation)))
F1 Score on Validation Data: 0.8372 Classification Report on Validation Data: precision recall f1-score support 0.0 0.96 0.90 0.93 49 1.0 0.78 0.90 0.84 20 accuracy 0.90 69 macro avg 0.87 0.90 0.88 69 weighted avg 0.91 0.90 0.90 69
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_preprocessed_validation_encoded, bagged_rf_optimal.predict(X_preprocessed_validation))
cm_normalized = confusion_matrix(y_preprocessed_validation_encoded, bagged_rf_optimal.predict(X_preprocessed_validation), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Random Forest Validation Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Random Forest Validation Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
##################################
# Gathering the model evaluation metrics
# for the train data
##################################
bagged_rf_optimal_train = model_performance_evaluation(y_preprocessed_train_encoded, bagged_rf_optimal.predict(X_preprocessed_train))
bagged_rf_optimal_train['model'] = ['bagged_rf_optimal'] * 5
bagged_rf_optimal_train['set'] = ['train'] * 5
print('Optimal Random Forest Train Performance Metrics: ')
display(bagged_rf_optimal_train)
Optimal Random Forest Train Performance Metrics:
metric_name | metric_value | model | set | |
---|---|---|---|---|
0 | Accuracy | 0.892157 | bagged_rf_optimal | train |
1 | Precision | 0.774648 | bagged_rf_optimal | train |
2 | Recall | 0.901639 | bagged_rf_optimal | train |
3 | F1 | 0.833333 | bagged_rf_optimal | train |
4 | AUROC | 0.894876 | bagged_rf_optimal | train |
##################################
# Gathering the model evaluation metrics
# for the validation data
##################################
bagged_rf_optimal_validation = model_performance_evaluation(y_preprocessed_validation_encoded, bagged_rf_optimal.predict(X_preprocessed_validation))
bagged_rf_optimal_validation['model'] = ['bagged_rf_optimal'] * 5
bagged_rf_optimal_validation['set'] = ['validation'] * 5
print('Optimal Random Forest Validation Performance Metrics: ')
display(bagged_rf_optimal_validation)
Optimal Random Forest Validation Performance Metrics:
metric_name | metric_value | model | set | |
---|---|---|---|---|
0 | Accuracy | 0.898551 | bagged_rf_optimal | validation |
1 | Precision | 0.782609 | bagged_rf_optimal | validation |
2 | Recall | 0.900000 | bagged_rf_optimal | validation |
3 | F1 | 0.837209 | bagged_rf_optimal | validation |
4 | AUROC | 0.898980 | bagged_rf_optimal | validation |
##################################
# Saving the best individual model
# developed from the train data
##################################
joblib.dump(bagged_rf_optimal,
os.path.join("..", MODELS_PATH, "bagged_model_random_forest_optimal.pkl"))
['..\\models\\bagged_model_random_forest_optimal.pkl']
1.7.2 Extra Trees ¶
Extra Trees (Extremely Randomized Trees) is a variation of Random Forest that introduces more randomness into tree construction to improve generalization. Similar to Random Forest, it builds multiple decision trees on bootstrapped datasets, but it differs in how it determines splits—rather than selecting the best split based on information gain or Gini impurity, Extra Trees splits randomly at each node from a subset of features. This extra randomness can prevent overfitting and make the model more robust to small variations in data. The key advantages of Extra Trees include its speed, as it does not need to search for the best split at each node, and its ability to handle large datasets efficiently. However, since it relies on random splits, it may not perform as well as Random Forest on some datasets, especially when strong feature interactions exist. Additionally, its randomness can make the model harder to interpret and tune effectively.
- The extra trees model from the sklearn.ensemble Python library API was implemented.
- The model contains 4 hyperparameters for tuning:
- criterion = function to measure the quality of a split made to vary between gini and entropy
- max_depth = maximum depth of the tree made to vary between 3 and 6
- min_samples_leaf = minimum number of samples required to be at a leaf node made to vary between 5 and 10
- n_estimators = number of base estimators in the ensemble made to vary between 100 and 200
- A special hyperparameter (class_weight = balanced) was fixed to address the minimal 2:1 class imbalance observed between the No and Yes Recurred categories.
- Hyperparameter tuning was conducted using the 5-cycle 5-fold cross-validation method with optimal model performance using the F1 score determined for:
- criterion = entropy
- max_depth = 6
- min_samples_leaf = 10
- n_estimators = 200
- The apparent model performance of the optimal model is summarized as follows:
- Accuracy = 0.8921
- Precision = 0.7746
- Recall = 0.9016
- F1 Score = 0.8333
- AUROC = 0.8948
- The independent validation model performance of the optimal model is summarized as follows:
- Accuracy = 0.8985
- Precision = 0.7826
- Recall = 0.9000
- F1 Score = 0.8372
- AUROC = 0.8989
- Sufficiently comparable apparent and independent validation model performance observed that might be indicative of the absence of excessive model overfitting.
##################################
# Defining the categorical preprocessing parameters
##################################
categorical_features = ['Gender','Smoking','Physical_Examination','Adenopathy','Focality','Risk','T','Stage','Response']
categorical_transformer = OrdinalEncoder()
categorical_preprocessor = ColumnTransformer(transformers=[
('cat', categorical_transformer, categorical_features)],
remainder='passthrough',
force_int_remainder_cols=False)
##################################
# Defining the preprocessing and modeling pipeline parameters
##################################
bagged_et_pipeline = Pipeline([
('categorical_preprocessor', categorical_preprocessor),
('bagged_et_model', ExtraTreesClassifier(class_weight='balanced',
random_state=987654321))
])
##################################
# Defining hyperparameter grid
##################################
bagged_et_hyperparameter_grid = {
'bagged_et_model__criterion': ['gini', 'entropy'],
'bagged_et_model__max_depth': [3, 6],
'bagged_et_model__min_samples_leaf': [5, 10],
'bagged_et_model__n_estimators': [100, 200]
}
##################################
# Defining the cross-validation strategy (5-cycle 5-fold CV)
##################################
cv_strategy = RepeatedStratifiedKFold(n_splits=5,
n_repeats=5,
random_state=987654321)
##################################
# Performing Grid Search with cross-validation
##################################
bagged_et_grid_search = GridSearchCV(
estimator=bagged_et_pipeline,
param_grid=bagged_et_hyperparameter_grid,
scoring='f1',
cv=cv_strategy,
n_jobs=-1,
verbose=1
)
##################################
# Encoding the response variables
# for model evaluation
##################################
y_encoder = OrdinalEncoder()
y_encoder.fit(y_preprocessed_train.values.reshape(-1, 1))
y_preprocessed_train_encoded = y_encoder.transform(y_preprocessed_train.values.reshape(-1, 1)).ravel()
y_preprocessed_validation_encoded = y_encoder.transform(y_preprocessed_validation.values.reshape(-1, 1)).ravel()
##################################
# Fitting GridSearchCV
##################################
bagged_et_grid_search.fit(X_preprocessed_train, y_preprocessed_train_encoded)
Fitting 25 folds for each of 16 candidates, totalling 400 fits
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321), estimator=Pipeline(steps=[('categorical_preprocessor', ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])])), ('bagged_et_model', ExtraTreesClassifier(class_weight='balanced', random_state=987654321))]), n_jobs=-1, param_grid={'bagged_et_model__criterion': ['gini', 'entropy'], 'bagged_et_model__max_depth': [3, 6], 'bagged_et_model__min_samples_leaf': [5, 10], 'bagged_et_model__n_estimators': [100, 200]}, scoring='f1', verbose=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321), estimator=Pipeline(steps=[('categorical_preprocessor', ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])])), ('bagged_et_model', ExtraTreesClassifier(class_weight='balanced', random_state=987654321))]), n_jobs=-1, param_grid={'bagged_et_model__criterion': ['gini', 'entropy'], 'bagged_et_model__max_depth': [3, 6], 'bagged_et_model__min_samples_leaf': [5, 10], 'bagged_et_model__n_estimators': [100, 200]}, scoring='f1', verbose=1)
Pipeline(steps=[('categorical_preprocessor', ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])])), ('bagged_et_model', ExtraTreesClassifier(class_weight='balanced', criterion='entropy', max_depth=6, min_samples_leaf=10, n_estimators=200, random_state=987654321))])
ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])])
['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response']
OrdinalEncoder()
['Age']
passthrough
ExtraTreesClassifier(class_weight='balanced', criterion='entropy', max_depth=6, min_samples_leaf=10, n_estimators=200, random_state=987654321)
##################################
# Identifying the best model
##################################
bagged_et_optimal = bagged_et_grid_search.best_estimator_
##################################
# Evaluating the F1 scores
# on the training, cross-validation, and validation data
##################################
bagged_et_optimal_f1_cv = bagged_et_grid_search.best_score_
bagged_et_optimal_f1_train = f1_score(y_preprocessed_train_encoded, bagged_et_optimal.predict(X_preprocessed_train))
bagged_et_optimal_f1_validation = f1_score(y_preprocessed_validation_encoded, bagged_et_optimal.predict(X_preprocessed_validation))
##################################
# Identifying the optimal model
##################################
print('Best Bagged Model – Extra Trees: ')
print(f"Best Extra Trees Hyperparameters: {bagged_et_grid_search.best_params_}")
Best Bagged Model – Extra Trees: Best Extra Trees Hyperparameters: {'bagged_et_model__criterion': 'entropy', 'bagged_et_model__max_depth': 6, 'bagged_et_model__min_samples_leaf': 10, 'bagged_et_model__n_estimators': 200}
##################################
# Summarizing the F1 score results
# and classification metrics
# on the training and cross-validated data
# to assess overfitting optimism
##################################
print(f"F1 Score on Cross-Validated Data: {bagged_et_optimal_f1_cv:.4f}")
print(f"F1 Score on Training Data: {bagged_et_optimal_f1_train:.4f}")
print("\nClassification Report on Train Data:\n", classification_report(y_preprocessed_train_encoded, bagged_et_optimal.predict(X_preprocessed_train)))
F1 Score on Cross-Validated Data: 0.8101 F1 Score on Training Data: 0.8333 Classification Report on Train Data: precision recall f1-score support 0.0 0.95 0.89 0.92 143 1.0 0.77 0.90 0.83 61 accuracy 0.89 204 macro avg 0.86 0.89 0.88 204 weighted avg 0.90 0.89 0.89 204
##################################
# Formulating the raw and normalized
# confusion matrices
# from the train data
##################################
cm_raw = confusion_matrix(y_preprocessed_train_encoded, bagged_et_optimal.predict(X_preprocessed_train))
cm_normalized = confusion_matrix(y_preprocessed_train_encoded, bagged_et_optimal.predict(X_preprocessed_train), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Extra Trees Train Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Extra Trees Train Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
# to assess overfitting optimism
##################################
print(f"F1 Score on Validation Data: {bagged_et_optimal_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_preprocessed_validation_encoded, bagged_et_optimal.predict(X_preprocessed_validation)))
F1 Score on Validation Data: 0.8372 Classification Report on Validation Data: precision recall f1-score support 0.0 0.96 0.90 0.93 49 1.0 0.78 0.90 0.84 20 accuracy 0.90 69 macro avg 0.87 0.90 0.88 69 weighted avg 0.91 0.90 0.90 69
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_preprocessed_validation_encoded, bagged_et_optimal.predict(X_preprocessed_validation))
cm_normalized = confusion_matrix(y_preprocessed_validation_encoded, bagged_et_optimal.predict(X_preprocessed_validation), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Extra Trees Validation Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Extra Trees Validation Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
##################################
# Gathering the model evaluation metrics
# for the train data
##################################
bagged_et_optimal_train = model_performance_evaluation(y_preprocessed_train_encoded, bagged_et_optimal.predict(X_preprocessed_train))
bagged_et_optimal_train['model'] = ['bagged_et_optimal'] * 5
bagged_et_optimal_train['set'] = ['train'] * 5
print('Optimal Extra Trees Train Performance Metrics: ')
display(bagged_et_optimal_train)
Optimal Extra Trees Train Performance Metrics:
metric_name | metric_value | model | set | |
---|---|---|---|---|
0 | Accuracy | 0.892157 | bagged_et_optimal | train |
1 | Precision | 0.774648 | bagged_et_optimal | train |
2 | Recall | 0.901639 | bagged_et_optimal | train |
3 | F1 | 0.833333 | bagged_et_optimal | train |
4 | AUROC | 0.894876 | bagged_et_optimal | train |
##################################
# Gathering the model evaluation metrics
# for the validation data
##################################
bagged_et_optimal_validation = model_performance_evaluation(y_preprocessed_validation_encoded, bagged_et_optimal.predict(X_preprocessed_validation))
bagged_et_optimal_validation['model'] = ['bagged_et_optimal'] * 5
bagged_et_optimal_validation['set'] = ['validation'] * 5
print('Optimal Extra Trees Validation Performance Metrics: ')
display(bagged_et_optimal_validation)
Optimal Extra Trees Validation Performance Metrics:
metric_name | metric_value | model | set | |
---|---|---|---|---|
0 | Accuracy | 0.898551 | bagged_et_optimal | validation |
1 | Precision | 0.782609 | bagged_et_optimal | validation |
2 | Recall | 0.900000 | bagged_et_optimal | validation |
3 | F1 | 0.837209 | bagged_et_optimal | validation |
4 | AUROC | 0.898980 | bagged_et_optimal | validation |
##################################
# Saving the best individual model
# developed from the train data
##################################
joblib.dump(bagged_et_optimal,
os.path.join("..", MODELS_PATH, "bagged_model_extra_trees_optimal.pkl"))
['..\\models\\bagged_model_extra_trees_optimal.pkl']
1.7.3 Bagged Decision Trees ¶
Bagged Decision Trees is an ensemble method that reduces overfitting by training multiple decision trees on different bootstrap samples and aggregating their predictions. Unlike Random Forest, all features are considered when finding the best split at each node, making it less random but still improving stability compared to a single decision tree. The process involves drawing multiple random subsets of the training data (with replacement), training a decision tree on each subset, and combining the predictions using majority voting for classification. This technique helps to reduce variance and prevent overfitting, leading to more stable and accurate predictions. The main advantage of Bagged Decision Trees is that they perform well on complex datasets without requiring deep tuning. However, the downside is that they require significant computational power and memory, as multiple trees must be trained and stored. Additionally, unlike boosting methods, bagging does not inherently improve bias, meaning the performance is still dependent on the base decision tree's predictive power.
- The bagging classifier and decision tree models from the sklearn.ensemble and sklearn.tree Python library APIs were implemented.
- The model contains 4 hyperparameters for tuning:
- criterion = function to measure the quality of a split made to vary between gini and entropy
- max_depth = maximum depth of the tree made to vary between 3 and 6
- min_samples_leaf = minimum number of samples required to be at a leaf node made to vary between 5 and 10
- n_estimators = number of base estimators in the ensemble made to vary between 100 and 200
- A special hyperparameter (class_weight = balanced) was fixed to address the minimal 2:1 class imbalance observed between the No and Yes Recurred categories.
- Hyperparameter tuning was conducted using the 5-cycle 5-fold cross-validation method with optimal model performance using the F1 score determined for:
- criterion = gini
- max_depth = 6
- min_samples_leaf = 5
- n_estimators = 200
- The apparent model performance of the optimal model is summarized as follows:
- Accuracy = 0.9019
- Precision = 0.7971
- Recall = 0.9016
- F1 Score = 0.8461
- AUROC = 0.9018
- The independent validation model performance of the optimal model is summarized as follows:
- Accuracy = 0.9130
- Precision = 0.8181
- Recall = 0.9000
- F1 Score = 0.8571
- AUROC = 0.9091
- Sufficiently comparable apparent and independent validation model performance observed that might be indicative of the absence of excessive model overfitting.
##################################
# Defining the categorical preprocessing parameters
##################################
categorical_features = ['Gender','Smoking','Physical_Examination','Adenopathy','Focality','Risk','T','Stage','Response']
categorical_transformer = OrdinalEncoder()
categorical_preprocessor = ColumnTransformer(transformers=[
('cat', categorical_transformer, categorical_features)],
remainder='passthrough',
force_int_remainder_cols=False)
##################################
# Defining the preprocessing and modeling pipeline parameters
##################################
bagged_bdt_pipeline = Pipeline([
('categorical_preprocessor', categorical_preprocessor),
('bagged_bdt_model', BaggingClassifier(estimator=DecisionTreeClassifier(class_weight='balanced',
random_state=987654321),
random_state=987654321))
])
##################################
# Defining hyperparameter grid
##################################
bagged_bdt_hyperparameter_grid = {
'bagged_bdt_model__estimator__criterion': ['gini', 'entropy'],
'bagged_bdt_model__estimator__max_depth': [3, 6],
'bagged_bdt_model__estimator__min_samples_leaf': [5, 10],
'bagged_bdt_model__n_estimators': [100, 200]
}
##################################
# Defining the cross-validation strategy (5-cycle 5-fold CV)
##################################
cv_strategy = RepeatedStratifiedKFold(n_splits=5,
n_repeats=5,
random_state=987654321)
##################################
# Performing Grid Search with cross-validation
##################################
bagged_bdt_grid_search = GridSearchCV(
estimator=bagged_bdt_pipeline,
param_grid=bagged_bdt_hyperparameter_grid,
scoring='f1',
cv=cv_strategy,
n_jobs=-1,
verbose=1
)
##################################
# Encoding the response variables
# for model evaluation
##################################
y_encoder = OrdinalEncoder()
y_encoder.fit(y_preprocessed_train.values.reshape(-1, 1))
y_preprocessed_train_encoded = y_encoder.transform(y_preprocessed_train.values.reshape(-1, 1)).ravel()
y_preprocessed_validation_encoded = y_encoder.transform(y_preprocessed_validation.values.reshape(-1, 1)).ravel()
##################################
# Fitting GridSearchCV
##################################
bagged_bdt_grid_search.fit(X_preprocessed_train, y_preprocessed_train_encoded)
Fitting 25 folds for each of 16 candidates, totalling 400 fits
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321), estimator=Pipeline(steps=[('categorical_preprocessor', ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])... BaggingClassifier(estimator=DecisionTreeClassifier(class_weight='balanced', random_state=987654321), random_state=987654321))]), n_jobs=-1, param_grid={'bagged_bdt_model__estimator__criterion': ['gini', 'entropy'], 'bagged_bdt_model__estimator__max_depth': [3, 6], 'bagged_bdt_model__estimator__min_samples_leaf': [5, 10], 'bagged_bdt_model__n_estimators': [100, 200]}, scoring='f1', verbose=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321), estimator=Pipeline(steps=[('categorical_preprocessor', ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])... BaggingClassifier(estimator=DecisionTreeClassifier(class_weight='balanced', random_state=987654321), random_state=987654321))]), n_jobs=-1, param_grid={'bagged_bdt_model__estimator__criterion': ['gini', 'entropy'], 'bagged_bdt_model__estimator__max_depth': [3, 6], 'bagged_bdt_model__estimator__min_samples_leaf': [5, 10], 'bagged_bdt_model__n_estimators': [100, 200]}, scoring='f1', verbose=1)
Pipeline(steps=[('categorical_preprocessor', ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])])), ('bagged_bdt_model', BaggingClassifier(estimator=DecisionTreeClassifier(class_weight='balanced', max_depth=6, min_samples_leaf=5, random_state=987654321), n_estimators=200, random_state=987654321))])
ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])])
['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response']
OrdinalEncoder()
['Age']
passthrough
BaggingClassifier(estimator=DecisionTreeClassifier(class_weight='balanced', max_depth=6, min_samples_leaf=5, random_state=987654321), n_estimators=200, random_state=987654321)
DecisionTreeClassifier(class_weight='balanced', max_depth=6, min_samples_leaf=5, random_state=987654321)
DecisionTreeClassifier(class_weight='balanced', max_depth=6, min_samples_leaf=5, random_state=987654321)
##################################
# Identifying the best model
##################################
bagged_bdt_optimal = bagged_bdt_grid_search.best_estimator_
##################################
# Evaluating the F1 scores
# on the training, cross-validation, and validation data
##################################
bagged_bdt_optimal_f1_cv = bagged_bdt_grid_search.best_score_
bagged_bdt_optimal_f1_train = f1_score(y_preprocessed_train_encoded, bagged_bdt_optimal.predict(X_preprocessed_train))
bagged_bdt_optimal_f1_validation = f1_score(y_preprocessed_validation_encoded, bagged_bdt_optimal.predict(X_preprocessed_validation))
##################################
# Identifying the optimal model
##################################
print('Best Bagged Model – Bagged Decision Trees: ')
print(f"Best Bagged Decision Trees Hyperparameters: {bagged_bdt_grid_search.best_params_}")
Best Bagged Model – Bagged Decision Trees: Best Bagged Decision Trees Hyperparameters: {'bagged_bdt_model__estimator__criterion': 'gini', 'bagged_bdt_model__estimator__max_depth': 6, 'bagged_bdt_model__estimator__min_samples_leaf': 5, 'bagged_bdt_model__n_estimators': 200}
##################################
# Summarizing the F1 score results
# and classification metrics
# on the training and cross-validated data
# to assess overfitting optimism
##################################
print(f"F1 Score on Cross-Validated Data: {bagged_bdt_optimal_f1_cv:.4f}")
print(f"F1 Score on Training Data: {bagged_bdt_optimal_f1_train:.4f}")
print("\nClassification Report on Train Data:\n", classification_report(y_preprocessed_train_encoded, bagged_bdt_optimal.predict(X_preprocessed_train)))
F1 Score on Cross-Validated Data: 0.8287 F1 Score on Training Data: 0.8462 Classification Report on Train Data: precision recall f1-score support 0.0 0.96 0.90 0.93 143 1.0 0.80 0.90 0.85 61 accuracy 0.90 204 macro avg 0.88 0.90 0.89 204 weighted avg 0.91 0.90 0.90 204
##################################
# Formulating the raw and normalized
# confusion matrices
# from the train data
##################################
cm_raw = confusion_matrix(y_preprocessed_train_encoded, bagged_bdt_optimal.predict(X_preprocessed_train))
cm_normalized = confusion_matrix(y_preprocessed_train_encoded, bagged_bdt_optimal.predict(X_preprocessed_train), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Bagged Decision Trees Train Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Bagged Decision Trees Train Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
# to assess overfitting optimism
##################################
print(f"F1 Score on Validation Data: {bagged_bdt_optimal_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_preprocessed_validation_encoded, bagged_bdt_optimal.predict(X_preprocessed_validation)))
F1 Score on Validation Data: 0.8571 Classification Report on Validation Data: precision recall f1-score support 0.0 0.96 0.92 0.94 49 1.0 0.82 0.90 0.86 20 accuracy 0.91 69 macro avg 0.89 0.91 0.90 69 weighted avg 0.92 0.91 0.91 69
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_preprocessed_validation_encoded, bagged_bdt_optimal.predict(X_preprocessed_validation))
cm_normalized = confusion_matrix(y_preprocessed_validation_encoded, bagged_bdt_optimal.predict(X_preprocessed_validation), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Bagged Decision Trees Validation Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Bagged Decision Trees Validation Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
##################################
# Gathering the model evaluation metrics
# for the train data
##################################
bagged_bdt_optimal_train = model_performance_evaluation(y_preprocessed_train_encoded, bagged_bdt_optimal.predict(X_preprocessed_train))
bagged_bdt_optimal_train['model'] = ['bagged_bdt_optimal'] * 5
bagged_bdt_optimal_train['set'] = ['train'] * 5
print('Optimal Bagged Decision Trees Train Performance Metrics: ')
display(bagged_bdt_optimal_train)
Optimal Bagged Decision Trees Train Performance Metrics:
metric_name | metric_value | model | set | |
---|---|---|---|---|
0 | Accuracy | 0.901961 | bagged_bdt_optimal | train |
1 | Precision | 0.797101 | bagged_bdt_optimal | train |
2 | Recall | 0.901639 | bagged_bdt_optimal | train |
3 | F1 | 0.846154 | bagged_bdt_optimal | train |
4 | AUROC | 0.901869 | bagged_bdt_optimal | train |
##################################
# Gathering the model evaluation metrics
# for the validation data
##################################
bagged_bdt_optimal_validation = model_performance_evaluation(y_preprocessed_validation_encoded, bagged_bdt_optimal.predict(X_preprocessed_validation))
bagged_bdt_optimal_validation['model'] = ['bagged_bdt_optimal'] * 5
bagged_bdt_optimal_validation['set'] = ['validation'] * 5
print('Optimal Bagged Decision Trees Validation Performance Metrics: ')
display(bagged_bdt_optimal_validation)
Optimal Bagged Decision Trees Validation Performance Metrics:
metric_name | metric_value | model | set | |
---|---|---|---|---|
0 | Accuracy | 0.913043 | bagged_bdt_optimal | validation |
1 | Precision | 0.818182 | bagged_bdt_optimal | validation |
2 | Recall | 0.900000 | bagged_bdt_optimal | validation |
3 | F1 | 0.857143 | bagged_bdt_optimal | validation |
4 | AUROC | 0.909184 | bagged_bdt_optimal | validation |
##################################
# Saving the best individual model
# developed from the train data
##################################
joblib.dump(bagged_bdt_optimal,
os.path.join("..", MODELS_PATH, "bagged_model_bagged_decision_trees_optimal.pkl"))
['..\\models\\bagged_model_bagged_decision_trees_optimal.pkl']
1.7.4 Bagged Logistic Regression ¶
Bagged Logistic Regression applies bootstrap aggregation (bagging) to logistic regression, improving its stability and generalization. Logistic regression is inherently a high-bias model, meaning it can underperform on complex, non-linear data. Bagging helps by training multiple logistic regression models on different bootstrap samples and averaging their probability outputs for final classification. This reduces variance and improves robustness, especially when dealing with small datasets prone to fluctuations. The main advantage is that it stabilizes logistic regression by reducing overfitting without adding significant complexity. Additionally, it works well when the relationship between features and the target variable is approximately linear. However, since logistic regression is a weak learner, bagging does not dramatically boost performance on highly non-linear problems. It is also computationally expensive compared to a single logistic regression model, and unlike boosting, it does not correct the inherent bias of logistic regression.
- The bagging classifier and logistic regression models from the sklearn.ensemble and sklearn.linear_model Python library APIs were implemented.
- The model contains 4 hyperparameters for tuning:
- C = inverse of regularization strength made to vary between 0.1 and 1.0
- penalty = penalty norm made to vary between l1 and l2
- solver = algorithm used in the optimization problem made to vary between liblinear and saga
- n_estimators = number of base estimators in the ensemble made to vary between 100 and 200
- A special hyperparameter (class_weight = balanced) was fixed to address the minimal 2:1 class imbalance observed between the No and Yes Recurred categories.
- Hyperparameter tuning was conducted using the 5-cycle 5-fold cross-validation method with optimal model performance using the F1 score determined for:
- C = 1.0
- penalty = l1
- solver = liblinear
- n_estimators = 200
- The apparent model performance of the optimal model is summarized as follows:
- Accuracy = 0.8921
- Precision = 0.7746
- Recall = 0.9016
- F1 Score = 0.8333
- AUROC = 0.8948
- The independent validation model performance of the optimal model is summarized as follows:
- Accuracy = 0.8985
- Precision = 0.7826
- Recall = 0.9000
- F1 Score = 0.8372
- AUROC = 0.8989
- Sufficiently comparable apparent and independent validation model performance observed that might be indicative of the absence of excessive model overfitting.
##################################
# Defining the categorical preprocessing parameters
##################################
categorical_features = ['Gender','Smoking','Physical_Examination','Adenopathy','Focality','Risk','T','Stage','Response']
categorical_transformer = OrdinalEncoder()
categorical_preprocessor = ColumnTransformer(transformers=[
('cat', categorical_transformer, categorical_features)],
remainder='passthrough',
force_int_remainder_cols=False)
##################################
# Defining the preprocessing and modeling pipeline parameters
##################################
bagged_blr_pipeline = Pipeline([
('categorical_preprocessor', categorical_preprocessor),
('bagged_blr_model', BaggingClassifier(estimator=LogisticRegression(class_weight='balanced',
random_state=987654321),
random_state=987654321))
])
##################################
# Defining hyperparameter grid
##################################
bagged_blr_hyperparameter_grid = {
'bagged_blr_model__estimator__C': [0.1, 1.0],
'bagged_blr_model__estimator__penalty': ['l1', 'l2'],
'bagged_blr_model__estimator__solver': ['liblinear', 'saga'],
'bagged_blr_model__n_estimators': [100, 200]
}
##################################
# Defining the cross-validation strategy (5-cycle 5-fold CV)
##################################
cv_strategy = RepeatedStratifiedKFold(n_splits=5,
n_repeats=5,
random_state=987654321)
##################################
# Performing Grid Search with cross-validation
##################################
bagged_blr_grid_search = GridSearchCV(
estimator=bagged_blr_pipeline,
param_grid=bagged_blr_hyperparameter_grid,
scoring='f1',
cv=cv_strategy,
n_jobs=-1,
verbose=1
)
##################################
# Encoding the response variables
# for model evaluation
##################################
y_encoder = OrdinalEncoder()
y_encoder.fit(y_preprocessed_train.values.reshape(-1, 1))
y_preprocessed_train_encoded = y_encoder.transform(y_preprocessed_train.values.reshape(-1, 1)).ravel()
y_preprocessed_validation_encoded = y_encoder.transform(y_preprocessed_validation.values.reshape(-1, 1)).ravel()
##################################
# Fitting GridSearchCV
##################################
bagged_blr_grid_search.fit(X_preprocessed_train, y_preprocessed_train_encoded)
Fitting 25 folds for each of 16 candidates, totalling 400 fits
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321), estimator=Pipeline(steps=[('categorical_preprocessor', ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])... BaggingClassifier(estimator=LogisticRegression(class_weight='balanced', random_state=987654321), random_state=987654321))]), n_jobs=-1, param_grid={'bagged_blr_model__estimator__C': [0.1, 1.0], 'bagged_blr_model__estimator__penalty': ['l1', 'l2'], 'bagged_blr_model__estimator__solver': ['liblinear', 'saga'], 'bagged_blr_model__n_estimators': [100, 200]}, scoring='f1', verbose=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321), estimator=Pipeline(steps=[('categorical_preprocessor', ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])... BaggingClassifier(estimator=LogisticRegression(class_weight='balanced', random_state=987654321), random_state=987654321))]), n_jobs=-1, param_grid={'bagged_blr_model__estimator__C': [0.1, 1.0], 'bagged_blr_model__estimator__penalty': ['l1', 'l2'], 'bagged_blr_model__estimator__solver': ['liblinear', 'saga'], 'bagged_blr_model__n_estimators': [100, 200]}, scoring='f1', verbose=1)
Pipeline(steps=[('categorical_preprocessor', ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])])), ('bagged_blr_model', BaggingClassifier(estimator=LogisticRegression(class_weight='balanced', penalty='l1', random_state=987654321, solver='liblinear'), n_estimators=200, random_state=987654321))])
ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])])
['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response']
OrdinalEncoder()
['Age']
passthrough
BaggingClassifier(estimator=LogisticRegression(class_weight='balanced', penalty='l1', random_state=987654321, solver='liblinear'), n_estimators=200, random_state=987654321)
LogisticRegression(class_weight='balanced', penalty='l1', random_state=987654321, solver='liblinear')
LogisticRegression(class_weight='balanced', penalty='l1', random_state=987654321, solver='liblinear')
##################################
# Identifying the best model
##################################
bagged_blr_optimal = bagged_blr_grid_search.best_estimator_
##################################
# Evaluating the F1 scores
# on the training, cross-validation, and validation data
##################################
bagged_blr_optimal_f1_cv = bagged_blr_grid_search.best_score_
bagged_blr_optimal_f1_train = f1_score(y_preprocessed_train_encoded, bagged_blr_optimal.predict(X_preprocessed_train))
bagged_blr_optimal_f1_validation = f1_score(y_preprocessed_validation_encoded, bagged_blr_optimal.predict(X_preprocessed_validation))
##################################
# Identifying the optimal model
##################################
print('Best Bagged Model – Bagged Logistic Regression: ')
print(f"Best Bagged Logistic Regression Hyperparameters: {bagged_blr_grid_search.best_params_}")
Best Bagged Model – Bagged Logistic Regression: Best Bagged Logistic Regression Hyperparameters: {'bagged_blr_model__estimator__C': 1.0, 'bagged_blr_model__estimator__penalty': 'l1', 'bagged_blr_model__estimator__solver': 'liblinear', 'bagged_blr_model__n_estimators': 200}
##################################
# Summarizing the F1 score results
# and classification metrics
# on the training and cross-validated data
# to assess overfitting optimism
##################################
print(f"F1 Score on Cross-Validated Data: {bagged_blr_optimal_f1_cv:.4f}")
print(f"F1 Score on Training Data: {bagged_blr_optimal_f1_train:.4f}")
print("\nClassification Report on Train Data:\n", classification_report(y_preprocessed_train_encoded, bagged_blr_optimal.predict(X_preprocessed_train)))
F1 Score on Cross-Validated Data: 0.8213 F1 Score on Training Data: 0.8333 Classification Report on Train Data: precision recall f1-score support 0.0 0.95 0.89 0.92 143 1.0 0.77 0.90 0.83 61 accuracy 0.89 204 macro avg 0.86 0.89 0.88 204 weighted avg 0.90 0.89 0.89 204
##################################
# Formulating the raw and normalized
# confusion matrices
# from the train data
##################################
cm_raw = confusion_matrix(y_preprocessed_train_encoded, bagged_blr_optimal.predict(X_preprocessed_train))
cm_normalized = confusion_matrix(y_preprocessed_train_encoded, bagged_blr_optimal.predict(X_preprocessed_train), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Bagged Logistic Regression Train Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Bagged Logistic Regression Train Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
# to assess overfitting optimism
##################################
print(f"F1 Score on Validation Data: {bagged_blr_optimal_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_preprocessed_validation_encoded, bagged_blr_optimal.predict(X_preprocessed_validation)))
F1 Score on Validation Data: 0.8372 Classification Report on Validation Data: precision recall f1-score support 0.0 0.96 0.90 0.93 49 1.0 0.78 0.90 0.84 20 accuracy 0.90 69 macro avg 0.87 0.90 0.88 69 weighted avg 0.91 0.90 0.90 69
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_preprocessed_validation_encoded, bagged_blr_optimal.predict(X_preprocessed_validation))
cm_normalized = confusion_matrix(y_preprocessed_validation_encoded, bagged_blr_optimal.predict(X_preprocessed_validation), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Bagged Logistic Regression Validation Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Bagged Logistic Regression Validation Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
##################################
# Gathering the model evaluation metrics
# for the train data
##################################
bagged_blr_optimal_train = model_performance_evaluation(y_preprocessed_train_encoded, bagged_blr_optimal.predict(X_preprocessed_train))
bagged_blr_optimal_train['model'] = ['bagged_blr_optimal'] * 5
bagged_blr_optimal_train['set'] = ['train'] * 5
print('Optimal Bagged Logistic Regression Train Performance Metrics: ')
display(bagged_blr_optimal_train)
Optimal Bagged Logistic Regression Train Performance Metrics:
metric_name | metric_value | model | set | |
---|---|---|---|---|
0 | Accuracy | 0.892157 | bagged_blr_optimal | train |
1 | Precision | 0.774648 | bagged_blr_optimal | train |
2 | Recall | 0.901639 | bagged_blr_optimal | train |
3 | F1 | 0.833333 | bagged_blr_optimal | train |
4 | AUROC | 0.894876 | bagged_blr_optimal | train |
##################################
# Gathering the model evaluation metrics
# for the validation data
##################################
bagged_blr_optimal_validation = model_performance_evaluation(y_preprocessed_validation_encoded, bagged_blr_optimal.predict(X_preprocessed_validation))
bagged_blr_optimal_validation['model'] = ['bagged_blr_optimal'] * 5
bagged_blr_optimal_validation['set'] = ['validation'] * 5
print('Optimal Bagged Logistic Regression Validation Performance Metrics: ')
display(bagged_blr_optimal_validation)
Optimal Bagged Logistic Regression Validation Performance Metrics:
metric_name | metric_value | model | set | |
---|---|---|---|---|
0 | Accuracy | 0.898551 | bagged_blr_optimal | validation |
1 | Precision | 0.782609 | bagged_blr_optimal | validation |
2 | Recall | 0.900000 | bagged_blr_optimal | validation |
3 | F1 | 0.837209 | bagged_blr_optimal | validation |
4 | AUROC | 0.898980 | bagged_blr_optimal | validation |
##################################
# Saving the best individual model
# developed from the train data
##################################
joblib.dump(bagged_blr_optimal,
os.path.join("..", MODELS_PATH, "bagged_model_bagged_logistic_regression_optimal.pkl"))
['..\\models\\bagged_model_bagged_logistic_regression_optimal.pkl']
1.7.5 Bagged Support Vector Machine ¶
Bagged Support Vector Machine is an ensemble method that applies bagging to multiple SVM classifiers trained on different bootstrap samples, reducing variance while maintaining SVM's strong classification capabilities. SVM works by finding an optimal decision boundary (hyperplane) that maximizes the margin between different classes. However, a single SVM can be sensitive to small changes in data, especially when working with noisy datasets. By training multiple SVM models on different subsets and aggregating their predictions (majority voting), bagging stabilizes the decision boundary and enhances robustness. This approach is particularly useful when dealing with high-dimensional datasets with complex relationships. The key advantages include improved generalization, reduced overfitting, and better handling of noisy data. However, SVM is computationally intensive, and bagging increases the overall training time significantly, especially for large datasets. Additionally, combining multiple SVM models makes interpretation difficult, and performance gains may not always justify the added computational cost.
- The bagging classifier and support vector machine models from the sklearn.ensemble and sklearn.svm Python library APIs were implemented.
- The model contains 4 hyperparameters for tuning:
- C = inverse of regularization strength made to vary between 0.1 and 1.0
- kernel = kernel type to be used in the algorithm made to vary between linear and rbf
- gamma = kernel coefficient made to vary between scale and auto
- n_estimators = number of base estimators in the ensemble made to vary between 100 and 200
- A special hyperparameter (class_weight = balanced) was fixed to address the minimal 2:1 class imbalance observed between the No and Yes Recurred categories.
- Hyperparameter tuning was conducted using the 5-cycle 5-fold cross-validation method with optimal model performance using the F1 score determined for:
- C = 1.0
- kernel = linear
- gamma = scale
- n_estimators = 100
- The apparent model performance of the optimal model is summarized as follows:
- Accuracy = 0.9068
- Precision = 0.8088
- Recall = 0.9016
- F1 Score = 0.8527
- AUROC = 0.9053
- The independent validation model performance of the optimal model is summarized as follows:
- Accuracy = 0.9130
- Precision = 0.8181
- Recall = 0.9000
- F1 Score = 0.8571
- AUROC = 0.9091
- Sufficiently comparable apparent and independent validation model performance observed that might be indicative of the absence of excessive model overfitting.
##################################
# Defining the categorical preprocessing parameters
##################################
categorical_features = ['Gender','Smoking','Physical_Examination','Adenopathy','Focality','Risk','T','Stage','Response']
categorical_transformer = OrdinalEncoder()
categorical_preprocessor = ColumnTransformer(transformers=[
('cat', categorical_transformer, categorical_features)],
remainder='passthrough',
force_int_remainder_cols=False)
##################################
# Defining the preprocessing and modeling pipeline parameters
##################################
bagged_bsvm_pipeline = Pipeline([
('categorical_preprocessor', categorical_preprocessor),
('bagged_bsvm_model', BaggingClassifier(estimator=SVC(class_weight='balanced',
random_state=987654321),
random_state=987654321))
])
##################################
# Defining hyperparameter grid
##################################
bagged_bsvm_hyperparameter_grid = {
'bagged_bsvm_model__estimator__C': [0.1, 1.0],
'bagged_bsvm_model__estimator__kernel': ['linear', 'rbf'],
'bagged_bsvm_model__estimator__gamma': ['scale','auto'],
'bagged_bsvm_model__n_estimators': [100, 200]
}
##################################
# Defining the cross-validation strategy (5-cycle 5-fold CV)
##################################
cv_strategy = RepeatedStratifiedKFold(n_splits=5,
n_repeats=5,
random_state=987654321)
##################################
# Performing Grid Search with cross-validation
##################################
bagged_bsvm_grid_search = GridSearchCV(
estimator=bagged_bsvm_pipeline,
param_grid=bagged_bsvm_hyperparameter_grid,
scoring='f1',
cv=cv_strategy,
n_jobs=-1,
verbose=1
)
##################################
# Encoding the response variables
# for model evaluation
##################################
y_encoder = OrdinalEncoder()
y_encoder.fit(y_preprocessed_train.values.reshape(-1, 1))
y_preprocessed_train_encoded = y_encoder.transform(y_preprocessed_train.values.reshape(-1, 1)).ravel()
y_preprocessed_validation_encoded = y_encoder.transform(y_preprocessed_validation.values.reshape(-1, 1)).ravel()
##################################
# Fitting GridSearchCV
##################################
bagged_bsvm_grid_search.fit(X_preprocessed_train, y_preprocessed_train_encoded)
Fitting 25 folds for each of 16 candidates, totalling 400 fits
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321), estimator=Pipeline(steps=[('categorical_preprocessor', ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])... BaggingClassifier(estimator=SVC(class_weight='balanced', random_state=987654321), random_state=987654321))]), n_jobs=-1, param_grid={'bagged_bsvm_model__estimator__C': [0.1, 1.0], 'bagged_bsvm_model__estimator__gamma': ['scale', 'auto'], 'bagged_bsvm_model__estimator__kernel': ['linear', 'rbf'], 'bagged_bsvm_model__n_estimators': [100, 200]}, scoring='f1', verbose=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321), estimator=Pipeline(steps=[('categorical_preprocessor', ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])... BaggingClassifier(estimator=SVC(class_weight='balanced', random_state=987654321), random_state=987654321))]), n_jobs=-1, param_grid={'bagged_bsvm_model__estimator__C': [0.1, 1.0], 'bagged_bsvm_model__estimator__gamma': ['scale', 'auto'], 'bagged_bsvm_model__estimator__kernel': ['linear', 'rbf'], 'bagged_bsvm_model__n_estimators': [100, 200]}, scoring='f1', verbose=1)
Pipeline(steps=[('categorical_preprocessor', ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])])), ('bagged_bsvm_model', BaggingClassifier(estimator=SVC(class_weight='balanced', kernel='linear', random_state=987654321), n_estimators=100, random_state=987654321))])
ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])])
['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response']
OrdinalEncoder()
['Age']
passthrough
BaggingClassifier(estimator=SVC(class_weight='balanced', kernel='linear', random_state=987654321), n_estimators=100, random_state=987654321)
SVC(class_weight='balanced', kernel='linear', random_state=987654321)
SVC(class_weight='balanced', kernel='linear', random_state=987654321)
##################################
# Identifying the best model
##################################
bagged_bsvm_optimal = bagged_bsvm_grid_search.best_estimator_
##################################
# Evaluating the F1 scores
# on the training, cross-validation, and validation data
##################################
bagged_bsvm_optimal_f1_cv = bagged_bsvm_grid_search.best_score_
bagged_bsvm_optimal_f1_train = f1_score(y_preprocessed_train_encoded, bagged_bsvm_optimal.predict(X_preprocessed_train))
bagged_bsvm_optimal_f1_validation = f1_score(y_preprocessed_validation_encoded, bagged_bsvm_optimal.predict(X_preprocessed_validation))
##################################
# Identifying the optimal model
##################################
print('Best Bagged Model – Bagged Support Vector Machine: ')
print(f"Best Bagged Support Vector Machine Hyperparameters: {bagged_bsvm_grid_search.best_params_}")
Best Bagged Model – Bagged Support Vector Machine: Best Bagged Support Vector Machine Hyperparameters: {'bagged_bsvm_model__estimator__C': 1.0, 'bagged_bsvm_model__estimator__gamma': 'scale', 'bagged_bsvm_model__estimator__kernel': 'linear', 'bagged_bsvm_model__n_estimators': 100}
##################################
# Summarizing the F1 score results
# and classification metrics
# on the training and cross-validated data
# to assess overfitting optimism
##################################
print(f"F1 Score on Cross-Validated Data: {bagged_bsvm_optimal_f1_cv:.4f}")
print(f"F1 Score on Training Data: {bagged_bsvm_optimal_f1_train:.4f}")
print("\nClassification Report on Train Data:\n", classification_report(y_preprocessed_train_encoded, bagged_bsvm_optimal.predict(X_preprocessed_train)))
F1 Score on Cross-Validated Data: 0.8209 F1 Score on Training Data: 0.8527 Classification Report on Train Data: precision recall f1-score support 0.0 0.96 0.91 0.93 143 1.0 0.81 0.90 0.85 61 accuracy 0.91 204 macro avg 0.88 0.91 0.89 204 weighted avg 0.91 0.91 0.91 204
##################################
# Formulating the raw and normalized
# confusion matrices
# from the train data
##################################
cm_raw = confusion_matrix(y_preprocessed_train_encoded, bagged_bsvm_optimal.predict(X_preprocessed_train))
cm_normalized = confusion_matrix(y_preprocessed_train_encoded, bagged_bsvm_optimal.predict(X_preprocessed_train), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Bagged Support Vector Machine Train Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Bagged Support Vector Machine Train Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
# to assess overfitting optimism
##################################
print(f"F1 Score on Validation Data: {bagged_bsvm_optimal_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_preprocessed_validation_encoded, bagged_bsvm_optimal.predict(X_preprocessed_validation)))
F1 Score on Validation Data: 0.8571 Classification Report on Validation Data: precision recall f1-score support 0.0 0.96 0.92 0.94 49 1.0 0.82 0.90 0.86 20 accuracy 0.91 69 macro avg 0.89 0.91 0.90 69 weighted avg 0.92 0.91 0.91 69
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_preprocessed_validation_encoded, bagged_bsvm_optimal.predict(X_preprocessed_validation))
cm_normalized = confusion_matrix(y_preprocessed_validation_encoded, bagged_bsvm_optimal.predict(X_preprocessed_validation), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Bagged Support Vector Machine Validation Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Bagged Support Vector Machine Validation Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
##################################
# Gathering the model evaluation metrics
# for the train data
##################################
bagged_bsvm_optimal_train = model_performance_evaluation(y_preprocessed_train_encoded, bagged_bsvm_optimal.predict(X_preprocessed_train))
bagged_bsvm_optimal_train['model'] = ['bagged_bsvm_optimal'] * 5
bagged_bsvm_optimal_train['set'] = ['train'] * 5
print('Optimal Bagged Support Vector Machine Train Performance Metrics: ')
display(bagged_bsvm_optimal_train)
Optimal Bagged Support Vector Machine Train Performance Metrics:
metric_name | metric_value | model | set | |
---|---|---|---|---|
0 | Accuracy | 0.906863 | bagged_bsvm_optimal | train |
1 | Precision | 0.808824 | bagged_bsvm_optimal | train |
2 | Recall | 0.901639 | bagged_bsvm_optimal | train |
3 | F1 | 0.852713 | bagged_bsvm_optimal | train |
4 | AUROC | 0.905365 | bagged_bsvm_optimal | train |
##################################
# Gathering the model evaluation metrics
# for the validation data
##################################
bagged_bsvm_optimal_validation = model_performance_evaluation(y_preprocessed_validation_encoded, bagged_bsvm_optimal.predict(X_preprocessed_validation))
bagged_bsvm_optimal_validation['model'] = ['bagged_bsvm_optimal'] * 5
bagged_bsvm_optimal_validation['set'] = ['validation'] * 5
print('Optimal Bagged Support Vector Machine Validation Performance Metrics: ')
display(bagged_bsvm_optimal_validation)
Optimal Bagged Support Vector Machine Validation Performance Metrics:
metric_name | metric_value | model | set | |
---|---|---|---|---|
0 | Accuracy | 0.913043 | bagged_bsvm_optimal | validation |
1 | Precision | 0.818182 | bagged_bsvm_optimal | validation |
2 | Recall | 0.900000 | bagged_bsvm_optimal | validation |
3 | F1 | 0.857143 | bagged_bsvm_optimal | validation |
4 | AUROC | 0.909184 | bagged_bsvm_optimal | validation |
##################################
# Saving the best individual model
# developed from the train data
##################################
joblib.dump(bagged_bsvm_optimal,
os.path.join("..", MODELS_PATH, "bagged_model_bagged_svm_optimal.pkl"))
['..\\models\\bagged_model_bagged_svm_optimal.pkl']
1.8. Boosted Model Development ¶
Boosting is an ensemble learning method that builds a strong classifier by training models sequentially, where each new model focuses on correcting the mistakes of its predecessors. Boosting assigns higher weights to misclassified instances, ensuring that subsequent models pay more attention to these hard-to-classify cases. The motivation behind boosting is to reduce both bias and variance by iteratively refining weak learners — models that perform only slightly better than random guessing — until they collectively form a strong classifier. In classification tasks, predictions are refined by combining weighted outputs of multiple weak models, typically decision stumps or shallow trees. This makes boosting highly effective in uncovering complex patterns in data. However, the sequential nature of boosting makes it computationally expensive compared to bagging, and it is more prone to overfitting if the number of weak learners is too high.
1.8.1 AdaBoost ¶
AdaBoost (Adaptive Boosting) is a boosting technique that combines multiple weak learners — typically decision stumps (shallow trees) — to form a strong classifier. It works by iteratively training weak models, assigning higher weights to misclassified instances so that subsequent models focus on difficult cases. At each iteration, a new weak model is trained, and its predictions are combined using a weighted voting mechanism. This process continues until a stopping criterion is met, such as a predefined number of iterations or performance threshold. AdaBoost is advantageous because it improves accuracy without overfitting if regularized properly. It performs well with clean data and can transform weak classifiers into strong ones. However, it is sensitive to noisy data and outliers, as misclassified points receive higher importance, leading to potential overfitting. Additionally, training can be slow for large datasets, and performance depends on the choice of base learner, typically decision trees.
- The adaboost model from the sklearn.ensemble Python library API was implemented.
- The model contains 3 hyperparameters for tuning:
- estimator_max_depth = maximum depth of the tree made to vary between 3 and 6
- learning_rate = weight applied to each classifier at each boosting iteration made to vary between 0.01 and 0.10
- n_estimators = maximum number of estimators at which boosting is terminated made to vary between 50 and 100
- No any hyperparameter was defined in the model to address the minimal 2:1 class imbalance observed between the No and Yes Recurred categories.
- Hyperparameter tuning was conducted using the 5-cycle 5-fold cross-validation method with optimal model performance using the F1 score determined for:
- estimator_max_depth = 2
- learning_rate = 0.01
- n_estimators = 50
- The apparent model performance of the optimal model is summarized as follows:
- Accuracy = 0.9019
- Precision = 0.8059
- Recall = 0.8852
- F1 Score = 0.8437
- AUROC = 0.8971
- The independent validation model performance of the optimal model is summarized as follows:
- Accuracy = 0.9130
- Precision = 0.8181
- Recall = 0.9000
- F1 Score = 0.8571
- AUROC = 0.9091
- Sufficiently comparable apparent and independent validation model performance observed that might be indicative of the absence of excessive model overfitting.
##################################
# Defining the categorical preprocessing parameters
##################################
categorical_features = ['Gender','Smoking','Physical_Examination','Adenopathy','Focality','Risk','T','Stage','Response']
categorical_transformer = OrdinalEncoder()
categorical_preprocessor = ColumnTransformer(transformers=[
('cat', categorical_transformer, categorical_features)],
remainder='passthrough',
force_int_remainder_cols=False)
##################################
# Defining the preprocessing and modeling pipeline parameters
##################################
boosted_ab_pipeline = Pipeline([
('categorical_preprocessor', categorical_preprocessor),
('boosted_ab_model', AdaBoostClassifier(estimator=DecisionTreeClassifier(random_state=987654321),
random_state=987654321))
])
##################################
# Defining hyperparameter grid
##################################
boosted_ab_hyperparameter_grid = {
'boosted_ab_model__learning_rate': [0.01, 0.10],
'boosted_ab_model__estimator__max_depth': [1, 2],
'boosted_ab_model__n_estimators': [50, 100]
}
##################################
# Defining the cross-validation strategy (5-cycle 5-fold CV)
##################################
cv_strategy = RepeatedStratifiedKFold(n_splits=5,
n_repeats=5,
random_state=987654321)
##################################
# Performing Grid Search with cross-validation
##################################
boosted_ab_grid_search = GridSearchCV(
estimator=boosted_ab_pipeline,
param_grid=boosted_ab_hyperparameter_grid,
scoring='f1',
cv=cv_strategy,
n_jobs=-1,
verbose=1
)
##################################
# Encoding the response variables
# for model evaluation
##################################
y_encoder = OrdinalEncoder()
y_encoder.fit(y_preprocessed_train.values.reshape(-1, 1))
y_preprocessed_train_encoded = y_encoder.transform(y_preprocessed_train.values.reshape(-1, 1)).ravel()
y_preprocessed_validation_encoded = y_encoder.transform(y_preprocessed_validation.values.reshape(-1, 1)).ravel()
##################################
# Fitting GridSearchCV
##################################
boosted_ab_grid_search.fit(X_preprocessed_train, y_preprocessed_train_encoded)
Fitting 25 folds for each of 8 candidates, totalling 200 fits
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321), estimator=Pipeline(steps=[('categorical_preprocessor', ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])])), ('boosted_ab_model', AdaBoostClassifier(estimator=DecisionTreeClassifier(random_state=987654321), random_state=987654321))]), n_jobs=-1, param_grid={'boosted_ab_model__estimator__max_depth': [1, 2], 'boosted_ab_model__learning_rate': [0.01, 0.1], 'boosted_ab_model__n_estimators': [50, 100]}, scoring='f1', verbose=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321), estimator=Pipeline(steps=[('categorical_preprocessor', ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])])), ('boosted_ab_model', AdaBoostClassifier(estimator=DecisionTreeClassifier(random_state=987654321), random_state=987654321))]), n_jobs=-1, param_grid={'boosted_ab_model__estimator__max_depth': [1, 2], 'boosted_ab_model__learning_rate': [0.01, 0.1], 'boosted_ab_model__n_estimators': [50, 100]}, scoring='f1', verbose=1)
Pipeline(steps=[('categorical_preprocessor', ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])])), ('boosted_ab_model', AdaBoostClassifier(estimator=DecisionTreeClassifier(max_depth=2, random_state=987654321), learning_rate=0.01, random_state=987654321))])
ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])])
['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response']
OrdinalEncoder()
['Age']
passthrough
AdaBoostClassifier(estimator=DecisionTreeClassifier(max_depth=2, random_state=987654321), learning_rate=0.01, random_state=987654321)
DecisionTreeClassifier(max_depth=2, random_state=987654321)
DecisionTreeClassifier(max_depth=2, random_state=987654321)
##################################
# Identifying the best model
##################################
boosted_ab_optimal = boosted_ab_grid_search.best_estimator_
##################################
# Evaluating the F1 scores
# on the training, cross-validation, and validation data
##################################
boosted_ab_optimal_f1_cv = boosted_ab_grid_search.best_score_
boosted_ab_optimal_f1_train = f1_score(y_preprocessed_train_encoded, boosted_ab_optimal.predict(X_preprocessed_train))
boosted_ab_optimal_f1_validation = f1_score(y_preprocessed_validation_encoded, boosted_ab_optimal.predict(X_preprocessed_validation))
##################################
# Identifying the optimal model
##################################
print('Best Boosted Model - AdaBoost: ')
print(f"Best AdaBoost Hyperparameters: {boosted_ab_grid_search.best_params_}")
Best Boosted Model - AdaBoost: Best AdaBoost Hyperparameters: {'boosted_ab_model__estimator__max_depth': 2, 'boosted_ab_model__learning_rate': 0.01, 'boosted_ab_model__n_estimators': 50}
##################################
# Summarizing the F1 score results
# and classification metrics
# on the training and cross-validated data
# to assess overfitting optimism
##################################
print(f"F1 Score on Cross-Validated Data: {boosted_ab_optimal_f1_cv:.4f}")
print(f"F1 Score on Training Data: {boosted_ab_optimal_f1_train:.4f}")
print("\nClassification Report on Train Data:\n", classification_report(y_preprocessed_train_encoded, boosted_ab_optimal.predict(X_preprocessed_train)))
F1 Score on Cross-Validated Data: 0.8364 F1 Score on Training Data: 0.8438 Classification Report on Train Data: precision recall f1-score support 0.0 0.95 0.91 0.93 143 1.0 0.81 0.89 0.84 61 accuracy 0.90 204 macro avg 0.88 0.90 0.89 204 weighted avg 0.91 0.90 0.90 204
##################################
# Formulating the raw and normalized
# confusion matrices
# from the train data
##################################
cm_raw = confusion_matrix(y_preprocessed_train_encoded, boosted_ab_optimal.predict(X_preprocessed_train))
cm_normalized = confusion_matrix(y_preprocessed_train_encoded, boosted_ab_optimal.predict(X_preprocessed_train), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal AdaBoost Train Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal AdaBoost Train Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
# to assess overfitting optimism
##################################
print(f"F1 Score on Validation Data: {boosted_ab_optimal_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_preprocessed_validation_encoded, boosted_ab_optimal.predict(X_preprocessed_validation)))
F1 Score on Validation Data: 0.8571 Classification Report on Validation Data: precision recall f1-score support 0.0 0.96 0.92 0.94 49 1.0 0.82 0.90 0.86 20 accuracy 0.91 69 macro avg 0.89 0.91 0.90 69 weighted avg 0.92 0.91 0.91 69
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_preprocessed_validation_encoded, boosted_ab_optimal.predict(X_preprocessed_validation))
cm_normalized = confusion_matrix(y_preprocessed_validation_encoded, boosted_ab_optimal.predict(X_preprocessed_validation), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal AdaBoost Validation Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal AdaBoost Validation Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
##################################
# Gathering the model evaluation metrics
# for the train data
##################################
boosted_ab_optimal_train = model_performance_evaluation(y_preprocessed_train_encoded, boosted_ab_optimal.predict(X_preprocessed_train))
boosted_ab_optimal_train['model'] = ['boosted_ab_optimal'] * 5
boosted_ab_optimal_train['set'] = ['train'] * 5
print('Optimal AdaBoost Train Performance Metrics: ')
display(boosted_ab_optimal_train)
Optimal AdaBoost Train Performance Metrics:
metric_name | metric_value | model | set | |
---|---|---|---|---|
0 | Accuracy | 0.901961 | boosted_ab_optimal | train |
1 | Precision | 0.805970 | boosted_ab_optimal | train |
2 | Recall | 0.885246 | boosted_ab_optimal | train |
3 | F1 | 0.843750 | boosted_ab_optimal | train |
4 | AUROC | 0.897168 | boosted_ab_optimal | train |
##################################
# Gathering the model evaluation metrics
# for the validation data
##################################
boosted_ab_optimal_validation = model_performance_evaluation(y_preprocessed_validation_encoded, boosted_ab_optimal.predict(X_preprocessed_validation))
boosted_ab_optimal_validation['model'] = ['boosted_ab_optimal'] * 5
boosted_ab_optimal_validation['set'] = ['validation'] * 5
print('Optimal AdaBoost Validation Performance Metrics: ')
display(boosted_ab_optimal_validation)
Optimal AdaBoost Validation Performance Metrics:
metric_name | metric_value | model | set | |
---|---|---|---|---|
0 | Accuracy | 0.913043 | boosted_ab_optimal | validation |
1 | Precision | 0.818182 | boosted_ab_optimal | validation |
2 | Recall | 0.900000 | boosted_ab_optimal | validation |
3 | F1 | 0.857143 | boosted_ab_optimal | validation |
4 | AUROC | 0.909184 | boosted_ab_optimal | validation |
##################################
# Saving the best individual model
# developed from the train data
##################################
joblib.dump(boosted_ab_optimal,
os.path.join("..", MODELS_PATH, "boosted_model_adaboost_optimal.pkl"))
['..\\models\\boosted_model_adaboost_optimal.pkl']
1.8.2 Gradient Boosting ¶
Gradient Boosting builds an ensemble of decision trees sequentially, where each new tree corrects the mistakes of the previous ones by optimizing a loss function. Unlike AdaBoost, which reweights misclassified instances, Gradient Boosting fits each new tree to the residual errors of the previous model, gradually improving predictions. This process continues until a stopping criterion, such as a set number of trees, is met. The key advantages of Gradient Boosting include its flexibility to model complex relationships and strong predictive performance, often outperforming bagging methods. It can handle both numeric and categorical data well. However, it is prone to overfitting if not carefully tuned, especially with deep trees and too many iterations. It is also computationally expensive due to sequential training, and hyperparameter tuning (e.g., learning rate, number of trees, tree depth) can be challenging and time-consuming.
- The gradient boosting model from the sklearn.ensemble Python library API was implemented.
- The model contains 4 hyperparameters for tuning:
- learning_rate = shrinking proportion of the contribution from each tree made to vary between 0.01 and 0.10
- max_depth = maximum depth of the tree made to vary between 3 and 6
- min_samples_leaf = minimum number of samples required to be at a leaf node made to vary between 5 and 10
- n_estimators = number of boosting stages to perform made to vary between 50 and 100
- No any hyperparameter was defined in the model to address the minimal 2:1 class imbalance observed between the No and Yes Recurred categories.
- Hyperparameter tuning was conducted using the 5-cycle 5-fold cross-validation method with optimal model performance using the F1 score determined for:
- learning_rate = 0.10
- max_depth = 3
- min_samples_leaf = 10
- n_estimators = 50
- The apparent model performance of the optimal model is summarized as follows:
- Accuracy = 0.9460
- Precision = 0.9032
- Recall = 0.9180
- F1 Score = 0.9105
- AUROC = 0.9380
- The independent validation model performance of the optimal model is summarized as follows:
- Accuracy = 0.8985
- Precision = 0.8095
- Recall = 0.8500
- F1 Score = 0.8292
- AUROC = 0.8841
- Relatively large difference in apparent and independent validation model performance observed that might be indicative of the presence of moderate model overfitting.
##################################
# Defining the categorical preprocessing parameters
##################################
categorical_features = ['Gender','Smoking','Physical_Examination','Adenopathy','Focality','Risk','T','Stage','Response']
categorical_transformer = OrdinalEncoder()
categorical_preprocessor = ColumnTransformer(transformers=[
('cat', categorical_transformer, categorical_features)],
remainder='passthrough',
force_int_remainder_cols=False)
##################################
# Defining the preprocessing and modeling pipeline parameters
##################################
boosted_gb_pipeline = Pipeline([
('categorical_preprocessor', categorical_preprocessor),
('boosted_gb_model', GradientBoostingClassifier(random_state=987654321))
])
##################################
# Defining hyperparameter grid
##################################
boosted_gb_hyperparameter_grid = {
'boosted_gb_model__learning_rate': [0.01, 0.10],
'boosted_gb_model__max_depth': [3, 6],
'boosted_gb_model__min_samples_leaf': [5, 10],
'boosted_gb_model__n_estimators': [50, 100]
}
##################################
# Defining the cross-validation strategy (5-cycle 5-fold CV)
##################################
cv_strategy = RepeatedStratifiedKFold(n_splits=5,
n_repeats=5,
random_state=987654321)
##################################
# Performing Grid Search with cross-validation
##################################
boosted_gb_grid_search = GridSearchCV(
estimator=boosted_gb_pipeline,
param_grid=boosted_gb_hyperparameter_grid,
scoring='f1',
cv=cv_strategy,
n_jobs=-1,
verbose=1
)
##################################
# Encoding the response variables
# for model evaluation
##################################
y_encoder = OrdinalEncoder()
y_encoder.fit(y_preprocessed_train.values.reshape(-1, 1))
y_preprocessed_train_encoded = y_encoder.transform(y_preprocessed_train.values.reshape(-1, 1)).ravel()
y_preprocessed_validation_encoded = y_encoder.transform(y_preprocessed_validation.values.reshape(-1, 1)).ravel()
##################################
# Fitting GridSearchCV
##################################
boosted_gb_grid_search.fit(X_preprocessed_train, y_preprocessed_train_encoded)
Fitting 25 folds for each of 16 candidates, totalling 400 fits
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321), estimator=Pipeline(steps=[('categorical_preprocessor', ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])])), ('boosted_gb_model', GradientBoostingClassifier(random_state=987654321))]), n_jobs=-1, param_grid={'boosted_gb_model__learning_rate': [0.01, 0.1], 'boosted_gb_model__max_depth': [3, 6], 'boosted_gb_model__min_samples_leaf': [5, 10], 'boosted_gb_model__n_estimators': [50, 100]}, scoring='f1', verbose=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321), estimator=Pipeline(steps=[('categorical_preprocessor', ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])])), ('boosted_gb_model', GradientBoostingClassifier(random_state=987654321))]), n_jobs=-1, param_grid={'boosted_gb_model__learning_rate': [0.01, 0.1], 'boosted_gb_model__max_depth': [3, 6], 'boosted_gb_model__min_samples_leaf': [5, 10], 'boosted_gb_model__n_estimators': [50, 100]}, scoring='f1', verbose=1)
Pipeline(steps=[('categorical_preprocessor', ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])])), ('boosted_gb_model', GradientBoostingClassifier(min_samples_leaf=10, n_estimators=50, random_state=987654321))])
ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])])
['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response']
OrdinalEncoder()
['Age']
passthrough
GradientBoostingClassifier(min_samples_leaf=10, n_estimators=50, random_state=987654321)
##################################
# Identifying the best model
##################################
boosted_gb_optimal = boosted_gb_grid_search.best_estimator_
##################################
# Evaluating the F1 scores
# on the training, cross-validation, and validation data
##################################
boosted_gb_optimal_f1_cv = boosted_gb_grid_search.best_score_
boosted_gb_optimal_f1_train = f1_score(y_preprocessed_train_encoded, boosted_gb_optimal.predict(X_preprocessed_train))
boosted_gb_optimal_f1_validation = f1_score(y_preprocessed_validation_encoded, boosted_gb_optimal.predict(X_preprocessed_validation))
##################################
# Identifying the optimal model
##################################
print('Best Boosted Model - Gradient Boosting: ')
print(f"Best Gradient Boosting Hyperparameters: {boosted_gb_grid_search.best_params_}")
Best Boosted Model - Gradient Boosting: Best Gradient Boosting Hyperparameters: {'boosted_gb_model__learning_rate': 0.1, 'boosted_gb_model__max_depth': 3, 'boosted_gb_model__min_samples_leaf': 10, 'boosted_gb_model__n_estimators': 50}
##################################
# Summarizing the F1 score results
# and classification metrics
# on the training and cross-validated data
# to assess overfitting optimism
##################################
print(f"F1 Score on Cross-Validated Data: {boosted_gb_optimal_f1_cv:.4f}")
print(f"F1 Score on Training Data: {boosted_gb_optimal_f1_train:.4f}")
print("\nClassification Report on Train Data:\n", classification_report(y_preprocessed_train_encoded, boosted_gb_optimal.predict(X_preprocessed_train)))
F1 Score on Cross-Validated Data: 0.8131 F1 Score on Training Data: 0.9106 Classification Report on Train Data: precision recall f1-score support 0.0 0.96 0.96 0.96 143 1.0 0.90 0.92 0.91 61 accuracy 0.95 204 macro avg 0.93 0.94 0.94 204 weighted avg 0.95 0.95 0.95 204
##################################
# Formulating the raw and normalized
# confusion matrices
# from the train data
##################################
cm_raw = confusion_matrix(y_preprocessed_train_encoded, boosted_gb_optimal.predict(X_preprocessed_train))
cm_normalized = confusion_matrix(y_preprocessed_train_encoded, boosted_gb_optimal.predict(X_preprocessed_train), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Gradient Boosting Train Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Gradient Boosting Train Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
# to assess overfitting optimism
##################################
print(f"F1 Score on Validation Data: {boosted_gb_optimal_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_preprocessed_validation_encoded, boosted_gb_optimal.predict(X_preprocessed_validation)))
F1 Score on Validation Data: 0.8293 Classification Report on Validation Data: precision recall f1-score support 0.0 0.94 0.92 0.93 49 1.0 0.81 0.85 0.83 20 accuracy 0.90 69 macro avg 0.87 0.88 0.88 69 weighted avg 0.90 0.90 0.90 69
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_preprocessed_validation_encoded, boosted_gb_optimal.predict(X_preprocessed_validation))
cm_normalized = confusion_matrix(y_preprocessed_validation_encoded, boosted_gb_optimal.predict(X_preprocessed_validation), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Gradient Boosting Validation Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Gradient Boosting Validation Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
##################################
# Gathering the model evaluation metrics
# for the train data
##################################
boosted_gb_optimal_train = model_performance_evaluation(y_preprocessed_train_encoded, boosted_gb_optimal.predict(X_preprocessed_train))
boosted_gb_optimal_train['model'] = ['boosted_gb_optimal'] * 5
boosted_gb_optimal_train['set'] = ['train'] * 5
print('Optimal Gradient Boosting Train Performance Metrics: ')
display(boosted_gb_optimal_train)
Optimal Gradient Boosting Train Performance Metrics:
metric_name | metric_value | model | set | |
---|---|---|---|---|
0 | Accuracy | 0.946078 | boosted_gb_optimal | train |
1 | Precision | 0.903226 | boosted_gb_optimal | train |
2 | Recall | 0.918033 | boosted_gb_optimal | train |
3 | F1 | 0.910569 | boosted_gb_optimal | train |
4 | AUROC | 0.938037 | boosted_gb_optimal | train |
##################################
# Gathering the model evaluation metrics
# for the validation data
##################################
boosted_gb_optimal_validation = model_performance_evaluation(y_preprocessed_validation_encoded, boosted_gb_optimal.predict(X_preprocessed_validation))
boosted_gb_optimal_validation['model'] = ['boosted_gb_optimal'] * 5
boosted_gb_optimal_validation['set'] = ['validation'] * 5
print('Optimal Gradient Boosting Validation Performance Metrics: ')
display(boosted_gb_optimal_validation)
Optimal Gradient Boosting Validation Performance Metrics:
metric_name | metric_value | model | set | |
---|---|---|---|---|
0 | Accuracy | 0.898551 | boosted_gb_optimal | validation |
1 | Precision | 0.809524 | boosted_gb_optimal | validation |
2 | Recall | 0.850000 | boosted_gb_optimal | validation |
3 | F1 | 0.829268 | boosted_gb_optimal | validation |
4 | AUROC | 0.884184 | boosted_gb_optimal | validation |
##################################
# Saving the best individual model
# developed from the train data
##################################
joblib.dump(boosted_gb_optimal,
os.path.join("..", MODELS_PATH, "boosted_model_gradient_boosting_optimal.pkl"))
['..\\models\\boosted_model_gradient_boosting_optimal.pkl']
1.8.3 XGBoost ¶
XGBoost (Extreme Gradient Boosting) is an optimized version of Gradient Boosting that introduces additional regularization and computational efficiencies. It builds decision trees sequentially, with each new tree correcting the residual errors of the previous ones, but it incorporates advanced techniques such as shrinkage (learning rate), column subsampling, and L1/L2 regularization to prevent overfitting. Additionally, XGBoost employs parallelization, reducing training time significantly compared to standard Gradient Boosting. It is widely used in machine learning competitions due to its superior accuracy and efficiency. The key advantages include its ability to handle missing data, built-in regularization for better generalization, and fast training through parallelization. However, XGBoost requires careful hyperparameter tuning to achieve optimal performance, and the model can become overly complex, making interpretation difficult. It is also memory-intensive, especially for large datasets, and can be challenging to deploy efficiently in real-time applications.
- The xgboost model from the xgboost Python library API was implemented.
- The model contains 4 hyperparameters for tuning:
- learning_rate = step size at which weights are updated during training made to vary between 0.01 and 0.10
- max_depth = maximum depth of the tree made to vary between 3 and 6
- gamma = minimum loss reduction required to make a further split in a tree made to vary between 0.10 and 0.20
- n_estimators = number of boosting stages to perform made to vary between 50 and 100
- A special hyperparameter (scale_pos_weight = 2.0) was fixed to address the minimal 2:1 class imbalance observed between the No and Yes Recurred categories.
- Hyperparameter tuning was conducted using the 5-cycle 5-fold cross-validation method with optimal model performance using the F1 score determined for:
- learning_rate = 0.01
- max_depth = 3
- gamma 0.10
- n_estimators = 50
- The apparent model performance of the optimal model is summarized as follows:
- Accuracy = 0.9068
- Precision = 0.8181
- Recall = 0.8852
- F1 Score = 0.8503
- AUROC = 0.9006
- The independent validation model performance of the optimal model is summarized as follows:
- Accuracy = 0.9130
- Precision = 0.8181
- Recall = 0.9000
- F1 Score = 0.8571
- AUROC = 0.9091
- Sufficiently comparable apparent and independent validation model performance observed that might be indicative of the absence of excessive model overfitting.
##################################
# Defining the categorical preprocessing parameters
##################################
categorical_features = ['Gender','Smoking','Physical_Examination','Adenopathy','Focality','Risk','T','Stage','Response']
categorical_transformer = OrdinalEncoder()
categorical_preprocessor = ColumnTransformer(transformers=[
('cat', categorical_transformer, categorical_features)],
remainder='passthrough',
force_int_remainder_cols=False)
##################################
# Defining the preprocessing and modeling pipeline parameters
##################################
boosted_xgb_pipeline = Pipeline([
('categorical_preprocessor', categorical_preprocessor),
('boosted_xgb_model', XGBClassifier(scale_pos_weight=2.0,
random_state=987654321,
subsample=0.7,
colsample_bytree=0.7,
eval_metric='logloss'))
])
##################################
# Defining hyperparameter grid
##################################
boosted_xgb_hyperparameter_grid = {
'boosted_xgb_model__learning_rate': [0.01, 0.10],
'boosted_xgb_model__max_depth': [3, 6],
'boosted_xgb_model__gamma': [0.1, 0.2],
'boosted_xgb_model__n_estimators': [50, 100]
}
##################################
# Defining the cross-validation strategy (5-cycle 5-fold CV)
##################################
cv_strategy = RepeatedStratifiedKFold(n_splits=5,
n_repeats=5,
random_state=987654321)
##################################
# Performing Grid Search with cross-validation
##################################
boosted_xgb_grid_search = GridSearchCV(
estimator=boosted_xgb_pipeline,
param_grid=boosted_xgb_hyperparameter_grid,
scoring='f1',
cv=cv_strategy,
n_jobs=-1,
verbose=1
)
##################################
# Encoding the response variables
# for model evaluation
##################################
y_encoder = OrdinalEncoder()
y_encoder.fit(y_preprocessed_train.values.reshape(-1, 1))
y_preprocessed_train_encoded = y_encoder.transform(y_preprocessed_train.values.reshape(-1, 1)).ravel()
y_preprocessed_validation_encoded = y_encoder.transform(y_preprocessed_validation.values.reshape(-1, 1)).ravel()
##################################
# Fitting GridSearchCV
##################################
boosted_xgb_grid_search.fit(X_preprocessed_train, y_preprocessed_train_encoded)
Fitting 25 folds for each of 16 candidates, totalling 400 fits
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321), estimator=Pipeline(steps=[('categorical_preprocessor', ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])... missing=nan, monotone_constraints=None, multi_strategy=None, n_estimators=None, n_jobs=None, num_parallel_tree=None, random_state=987654321, ...))]), n_jobs=-1, param_grid={'boosted_xgb_model__gamma': [0.1, 0.2], 'boosted_xgb_model__learning_rate': [0.01, 0.1], 'boosted_xgb_model__max_depth': [3, 6], 'boosted_xgb_model__n_estimators': [50, 100]}, scoring='f1', verbose=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321), estimator=Pipeline(steps=[('categorical_preprocessor', ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])... missing=nan, monotone_constraints=None, multi_strategy=None, n_estimators=None, n_jobs=None, num_parallel_tree=None, random_state=987654321, ...))]), n_jobs=-1, param_grid={'boosted_xgb_model__gamma': [0.1, 0.2], 'boosted_xgb_model__learning_rate': [0.01, 0.1], 'boosted_xgb_model__max_depth': [3, 6], 'boosted_xgb_model__n_estimators': [50, 100]}, scoring='f1', verbose=1)
Pipeline(steps=[('categorical_preprocessor', ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])])), ('boosted_xgb_model', XGBClassifier(base_score=None, booster=None, callbacks=None, colsample_byle... feature_types=None, gamma=0.1, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=0.01, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=3, max_leaves=None, min_child_weight=None, missing=nan, monotone_constraints=None, multi_strategy=None, n_estimators=50, n_jobs=None, num_parallel_tree=None, random_state=987654321, ...))])
ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])])
['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response']
OrdinalEncoder()
['Age']
passthrough
XGBClassifier(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=0.7, device=None, early_stopping_rounds=None, enable_categorical=False, eval_metric='logloss', feature_types=None, gamma=0.1, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=0.01, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=3, max_leaves=None, min_child_weight=None, missing=nan, monotone_constraints=None, multi_strategy=None, n_estimators=50, n_jobs=None, num_parallel_tree=None, random_state=987654321, ...)
##################################
# Identifying the best model
##################################
boosted_xgb_optimal = boosted_xgb_grid_search.best_estimator_
##################################
# Evaluating the F1 scores
# on the training, cross-validation, and validation data
##################################
boosted_xgb_optimal_f1_cv = boosted_xgb_grid_search.best_score_
boosted_xgb_optimal_f1_train = f1_score(y_preprocessed_train_encoded, boosted_xgb_optimal.predict(X_preprocessed_train))
boosted_xgb_optimal_f1_validation = f1_score(y_preprocessed_validation_encoded, boosted_xgb_optimal.predict(X_preprocessed_validation))
##################################
# Identifying the optimal model
##################################
print('Best Boosted Model - XGBoost: ')
print(f"Best XGBoost Hyperparameters: {boosted_xgb_grid_search.best_params_}")
Best Boosted Model - XGBoost: Best XGBoost Hyperparameters: {'boosted_xgb_model__gamma': 0.1, 'boosted_xgb_model__learning_rate': 0.01, 'boosted_xgb_model__max_depth': 3, 'boosted_xgb_model__n_estimators': 50}
##################################
# Summarizing the F1 score results
# and classification metrics
# on the training and cross-validated data
# to assess overfitting optimism
##################################
print(f"F1 Score on Cross-Validated Data: {boosted_xgb_optimal_f1_cv:.4f}")
print(f"F1 Score on Training Data: {boosted_xgb_optimal_f1_train:.4f}")
print("\nClassification Report on Train Data:\n", classification_report(y_preprocessed_train_encoded, boosted_xgb_optimal.predict(X_preprocessed_train)))
F1 Score on Cross-Validated Data: 0.8322 F1 Score on Training Data: 0.8504 Classification Report on Train Data: precision recall f1-score support 0.0 0.95 0.92 0.93 143 1.0 0.82 0.89 0.85 61 accuracy 0.91 204 macro avg 0.88 0.90 0.89 204 weighted avg 0.91 0.91 0.91 204
##################################
# Formulating the raw and normalized
# confusion matrices
# from the train data
##################################
cm_raw = confusion_matrix(y_preprocessed_train_encoded, boosted_xgb_optimal.predict(X_preprocessed_train))
cm_normalized = confusion_matrix(y_preprocessed_train_encoded, boosted_xgb_optimal.predict(X_preprocessed_train), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal XGBoost Train Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal XGBoost Train Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
# to assess overfitting optimism
##################################
print(f"F1 Score on Validation Data: {boosted_xgb_optimal_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_preprocessed_validation_encoded, boosted_xgb_optimal.predict(X_preprocessed_validation)))
F1 Score on Validation Data: 0.8571 Classification Report on Validation Data: precision recall f1-score support 0.0 0.96 0.92 0.94 49 1.0 0.82 0.90 0.86 20 accuracy 0.91 69 macro avg 0.89 0.91 0.90 69 weighted avg 0.92 0.91 0.91 69
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_preprocessed_validation_encoded, boosted_xgb_optimal.predict(X_preprocessed_validation))
cm_normalized = confusion_matrix(y_preprocessed_validation_encoded, boosted_xgb_optimal.predict(X_preprocessed_validation), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal XGBoost Validation Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal XGBoost Validation Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
##################################
# Gathering the model evaluation metrics
# for the train data
##################################
boosted_xgb_optimal_train = model_performance_evaluation(y_preprocessed_train_encoded, boosted_xgb_optimal.predict(X_preprocessed_train))
boosted_xgb_optimal_train['model'] = ['boosted_xgb_optimal'] * 5
boosted_xgb_optimal_train['set'] = ['train'] * 5
print('Optimal XGBoost Train Performance Metrics: ')
display(boosted_xgb_optimal_train)
Optimal XGBoost Train Performance Metrics:
metric_name | metric_value | model | set | |
---|---|---|---|---|
0 | Accuracy | 0.906863 | boosted_xgb_optimal | train |
1 | Precision | 0.818182 | boosted_xgb_optimal | train |
2 | Recall | 0.885246 | boosted_xgb_optimal | train |
3 | F1 | 0.850394 | boosted_xgb_optimal | train |
4 | AUROC | 0.900665 | boosted_xgb_optimal | train |
##################################
# Gathering the model evaluation metrics
# for the validation data
##################################
boosted_xgb_optimal_validation = model_performance_evaluation(y_preprocessed_validation_encoded, boosted_xgb_optimal.predict(X_preprocessed_validation))
boosted_xgb_optimal_validation['model'] = ['boosted_xgb_optimal'] * 5
boosted_xgb_optimal_validation['set'] = ['validation'] * 5
print('Optimal XGBoost Validation Performance Metrics: ')
display(boosted_xgb_optimal_validation)
Optimal XGBoost Validation Performance Metrics:
metric_name | metric_value | model | set | |
---|---|---|---|---|
0 | Accuracy | 0.913043 | boosted_xgb_optimal | validation |
1 | Precision | 0.818182 | boosted_xgb_optimal | validation |
2 | Recall | 0.900000 | boosted_xgb_optimal | validation |
3 | F1 | 0.857143 | boosted_xgb_optimal | validation |
4 | AUROC | 0.909184 | boosted_xgb_optimal | validation |
##################################
# Saving the best individual model
# developed from the train data
##################################
joblib.dump(boosted_xgb_optimal,
os.path.join("..", MODELS_PATH, "boosted_model_xgboost_optimal.pkl"))
['..\\models\\boosted_model_xgboost_optimal.pkl']
1.8.4 Light GBM ¶
Light GBM (Light Gradient Boosting Machine) is a variation of Gradient Boosting designed for efficiency and scalability. Unlike traditional boosting methods that grow trees level by level, LightGBM grows trees leaf-wise, choosing the most informative splits, leading to faster convergence. It also uses histogram-based binning to speed up computations. These optimizations allow LightGBM to train on large datasets efficiently while maintaining high accuracy. Its advantages include faster training speed, reduced memory usage, and strong predictive performance, particularly for large datasets with many features. However, LightGBM can overfit more easily than XGBoost if not properly tuned, and it may not perform as well on small datasets. Additionally, its handling of categorical variables requires careful preprocessing, and the leaf-wise tree growth can sometimes lead to instability if not controlled properly.
- The light gbm model from the light Python library API was implemented.
- The model contains 4 hyperparameters for tuning:
- learning_rate = step size at which weights are updated during training made to vary between 0.01 and 0.10
- min_child_samples = minimum number of data needed in a child 3 and 6
- num_leaves = maximum tree leaves for base learners made to vary between 8 and 16
- n_estimators = number of boosted trees to fit made to vary between 50 and 100
- A special hyperparameter (scale_pos_weight = 2.0) was fixed to address the minimal 2:1 class imbalance observed between the No and Yes Recurred categories.
- Hyperparameter tuning was conducted using the 5-cycle 5-fold cross-validation method with optimal model performance using the F1 score determined for:
- learning_rate = 0.01
- min_child_samples = 6
- num_leaves 16
- n_estimators = 100
- The apparent model performance of the optimal model is summarized as follows:
- Accuracy = 0.9362
- Precision = 0.8870
- Recall = 0.9016
- F1 Score = 0.8943
- AUROC = 0.9263
- The independent validation model performance of the optimal model is summarized as follows:
- Accuracy = 0.8985
- Precision = 0.8421
- Recall = 0.8000
- F1 Score = 0.8205
- AUROC = 0.8693
- Relatively large difference in apparent and independent validation model performance observed that might be indicative of the presence of moderate model overfitting.
##################################
# Defining the categorical preprocessing parameters
##################################
categorical_features = ['Gender','Smoking','Physical_Examination','Adenopathy','Focality','Risk','T','Stage','Response']
categorical_transformer = OrdinalEncoder()
categorical_preprocessor = ColumnTransformer(transformers=[
('cat', categorical_transformer, categorical_features)],
remainder='passthrough',
force_int_remainder_cols=False)
##################################
# Defining the preprocessing and modeling pipeline parameters
##################################
boosted_lgbm_pipeline = Pipeline([
('categorical_preprocessor', categorical_preprocessor),
('boosted_lgbm_model', LGBMClassifier(scale_pos_weight=2.0,
random_state=987654321,
max_depth=-1,
feature_fraction =0.7,
bagging_fraction=0.7,
verbose=-1))
])
##################################
# Defining hyperparameter grid
##################################
boosted_lgbm_hyperparameter_grid = {
'boosted_lgbm_model__learning_rate': [0.01, 0.10],
'boosted_lgbm_model__min_child_samples': [3, 6],
'boosted_lgbm_model__num_leaves': [8, 16],
'boosted_lgbm_model__n_estimators': [50, 100]
}
##################################
# Defining the cross-validation strategy (5-cycle 5-fold CV)
##################################
cv_strategy = RepeatedStratifiedKFold(n_splits=5,
n_repeats=5,
random_state=987654321)
##################################
# Performing Grid Search with cross-validation
##################################
boosted_lgbm_grid_search = GridSearchCV(
estimator=boosted_lgbm_pipeline,
param_grid=boosted_lgbm_hyperparameter_grid,
scoring='f1',
cv=cv_strategy,
n_jobs=-1,
verbose=1
)
##################################
# Encoding the response variables
# for model evaluation
##################################
y_encoder = OrdinalEncoder()
y_encoder.fit(y_preprocessed_train.values.reshape(-1, 1))
y_preprocessed_train_encoded = y_encoder.transform(y_preprocessed_train.values.reshape(-1, 1)).ravel()
y_preprocessed_validation_encoded = y_encoder.transform(y_preprocessed_validation.values.reshape(-1, 1)).ravel()
##################################
# Fitting GridSearchCV
##################################
boosted_lgbm_grid_search.fit(X_preprocessed_train, y_preprocessed_train_encoded)
Fitting 25 folds for each of 16 candidates, totalling 400 fits
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321), estimator=Pipeline(steps=[('categorical_preprocessor', ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])... ('boosted_lgbm_model', LGBMClassifier(bagging_fraction=0.7, feature_fraction=0.7, random_state=987654321, scale_pos_weight=2.0, verbose=-1))]), n_jobs=-1, param_grid={'boosted_lgbm_model__learning_rate': [0.01, 0.1], 'boosted_lgbm_model__min_child_samples': [3, 6], 'boosted_lgbm_model__n_estimators': [50, 100], 'boosted_lgbm_model__num_leaves': [8, 16]}, scoring='f1', verbose=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321), estimator=Pipeline(steps=[('categorical_preprocessor', ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])... ('boosted_lgbm_model', LGBMClassifier(bagging_fraction=0.7, feature_fraction=0.7, random_state=987654321, scale_pos_weight=2.0, verbose=-1))]), n_jobs=-1, param_grid={'boosted_lgbm_model__learning_rate': [0.01, 0.1], 'boosted_lgbm_model__min_child_samples': [3, 6], 'boosted_lgbm_model__n_estimators': [50, 100], 'boosted_lgbm_model__num_leaves': [8, 16]}, scoring='f1', verbose=1)
Pipeline(steps=[('categorical_preprocessor', ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])])), ('boosted_lgbm_model', LGBMClassifier(bagging_fraction=0.7, feature_fraction=0.7, learning_rate=0.01, min_child_samples=6, num_leaves=16, random_state=987654321, scale_pos_weight=2.0, verbose=-1))])
ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])])
['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response']
OrdinalEncoder()
['Age']
passthrough
LGBMClassifier(bagging_fraction=0.7, feature_fraction=0.7, learning_rate=0.01, min_child_samples=6, num_leaves=16, random_state=987654321, scale_pos_weight=2.0, verbose=-1)
##################################
# Identifying the best model
##################################
boosted_lgbm_optimal = boosted_lgbm_grid_search.best_estimator_
##################################
# Evaluating the F1 scores
# on the training, cross-validation, and validation data
##################################
import warnings
warnings.filterwarnings('ignore', category=UserWarning, module='sklearn.utils.validation')
boosted_lgbm_optimal_f1_cv = boosted_lgbm_grid_search.best_score_
boosted_lgbm_optimal_f1_train = f1_score(y_preprocessed_train_encoded, boosted_lgbm_optimal.predict(X_preprocessed_train))
boosted_lgbm_optimal_f1_validation = f1_score(y_preprocessed_validation_encoded, boosted_lgbm_optimal.predict(X_preprocessed_validation))
##################################
# Identifying the optimal model
##################################
print('Best Boosted Model - Light GBM: ')
print(f"Best Light GBM Hyperparameters: {boosted_lgbm_grid_search.best_params_}")
Best Boosted Model - Light GBM: Best Light GBM Hyperparameters: {'boosted_lgbm_model__learning_rate': 0.01, 'boosted_lgbm_model__min_child_samples': 6, 'boosted_lgbm_model__n_estimators': 100, 'boosted_lgbm_model__num_leaves': 16}
##################################
# Summarizing the F1 score results
# and classification metrics
# on the training and cross-validated data
# to assess overfitting optimism
##################################
print(f"F1 Score on Cross-Validated Data: {boosted_lgbm_optimal_f1_cv:.4f}")
print(f"F1 Score on Training Data: {boosted_lgbm_optimal_f1_train:.4f}")
print("\nClassification Report on Train Data:\n", classification_report(y_preprocessed_train_encoded, boosted_lgbm_optimal.predict(X_preprocessed_train)))
F1 Score on Cross-Validated Data: 0.8182 F1 Score on Training Data: 0.8943 Classification Report on Train Data: precision recall f1-score support 0.0 0.96 0.95 0.95 143 1.0 0.89 0.90 0.89 61 accuracy 0.94 204 macro avg 0.92 0.93 0.92 204 weighted avg 0.94 0.94 0.94 204
##################################
# Formulating the raw and normalized
# confusion matrices
# from the train data
##################################
cm_raw = confusion_matrix(y_preprocessed_train_encoded, boosted_lgbm_optimal.predict(X_preprocessed_train))
cm_normalized = confusion_matrix(y_preprocessed_train_encoded, boosted_lgbm_optimal.predict(X_preprocessed_train), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Light GBM Train Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Light GBM Train Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
# to assess overfitting optimism
##################################
print(f"F1 Score on Validation Data: {boosted_lgbm_optimal_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_preprocessed_validation_encoded, boosted_lgbm_optimal.predict(X_preprocessed_validation)))
F1 Score on Validation Data: 0.8205 Classification Report on Validation Data: precision recall f1-score support 0.0 0.92 0.94 0.93 49 1.0 0.84 0.80 0.82 20 accuracy 0.90 69 macro avg 0.88 0.87 0.87 69 weighted avg 0.90 0.90 0.90 69
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_preprocessed_validation_encoded, boosted_lgbm_optimal.predict(X_preprocessed_validation))
cm_normalized = confusion_matrix(y_preprocessed_validation_encoded, boosted_lgbm_optimal.predict(X_preprocessed_validation), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Light GBM Validation Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Light GBM Validation Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
##################################
# Gathering the model evaluation metrics
# for the train data
##################################
boosted_lgbm_optimal_train = model_performance_evaluation(y_preprocessed_train_encoded, boosted_lgbm_optimal.predict(X_preprocessed_train))
boosted_lgbm_optimal_train['model'] = ['boosted_lgbm_optimal'] * 5
boosted_lgbm_optimal_train['set'] = ['train'] * 5
print('Optimal Light GBM Train Performance Metrics: ')
display(boosted_lgbm_optimal_train)
Optimal Light GBM Train Performance Metrics:
metric_name | metric_value | model | set | |
---|---|---|---|---|
0 | Accuracy | 0.936275 | boosted_lgbm_optimal | train |
1 | Precision | 0.887097 | boosted_lgbm_optimal | train |
2 | Recall | 0.901639 | boosted_lgbm_optimal | train |
3 | F1 | 0.894309 | boosted_lgbm_optimal | train |
4 | AUROC | 0.926344 | boosted_lgbm_optimal | train |
##################################
# Gathering the model evaluation metrics
# for the validation data
##################################
boosted_lgbm_optimal_validation = model_performance_evaluation(y_preprocessed_validation_encoded, boosted_lgbm_optimal.predict(X_preprocessed_validation))
boosted_lgbm_optimal_validation['model'] = ['boosted_lgbm_optimal'] * 5
boosted_lgbm_optimal_validation['set'] = ['validation'] * 5
print('Optimal Light GBM Validation Performance Metrics: ')
display(boosted_lgbm_optimal_validation)
Optimal Light GBM Validation Performance Metrics:
metric_name | metric_value | model | set | |
---|---|---|---|---|
0 | Accuracy | 0.898551 | boosted_lgbm_optimal | validation |
1 | Precision | 0.842105 | boosted_lgbm_optimal | validation |
2 | Recall | 0.800000 | boosted_lgbm_optimal | validation |
3 | F1 | 0.820513 | boosted_lgbm_optimal | validation |
4 | AUROC | 0.869388 | boosted_lgbm_optimal | validation |
##################################
# Saving the best individual model
# developed from the train data
##################################
joblib.dump(boosted_lgbm_optimal,
os.path.join("..", MODELS_PATH, "boosted_model_light_gbm_optimal.pkl"))
['..\\models\\boosted_model_light_gbm_optimal.pkl']
1.8.5 CatBoost ¶
CatBoost (Categorical Boosting) is a boosting algorithm optimized for categorical data. Unlike other gradient boosting methods that require categorical variables to be manually encoded, CatBoost handles them natively, reducing preprocessing effort and improving performance. It builds decision trees iteratively, like other boosting methods, but uses ordered boosting to prevent target leakage and enhance generalization. The main advantages of CatBoost are its ability to handle categorical data without extensive preprocessing, high accuracy with minimal tuning, and robustness against overfitting due to built-in regularization. Additionally, it is relatively fast and memory-efficient. However, CatBoost can still be slower than LightGBM on very large datasets, and while it requires less tuning, improper parameter selection can lead to suboptimal performance. Its internal mechanics, such as ordered boosting, make interpretation more complex compared to simpler models.
- The catboost model from the catboost Python library API was implemented.
- The model contains 4 hyperparameters for tuning:
- learning_rate = step size at which weights are updated during training made to vary between 0.01 and 0.10
- max_depth = maximum depth of each decision tree in the boosting process made to vary between 3 and 6
- num_leaves = maximum tree leaves for base learners made to vary between 8 and 16
- iterations = number of boosted trees to fit made to vary between 50 and 100
- A special hyperparameter (scale_pos_weight = 2.0) was fixed to address the minimal 2:1 class imbalance observed between the No and Yes Recurred categories.
- Hyperparameter tuning was conducted using the 5-cycle 5-fold cross-validation method with optimal model performance using the F1 score determined for:
- learning_rate = 0.01
- min_child_samples = 3
- num_leaves 8
- n_estimators = 50
- The apparent model performance of the optimal model is summarized as follows:
- Accuracy = 0.9019
- Precision = 0.8059
- Recall = 0.8852
- F1 Score = 0.8437
- AUROC = 0.8971
- The independent validation model performance of the optimal model is summarized as follows:
- Accuracy = 0.9130
- Precision = 0.8181
- Recall = 0.9000
- F1 Score = 0.8571
- AUROC = 0.9091
- Sufficiently comparable apparent and independent validation model performance observed that might be indicative of the absence of excessive model overfitting.
##################################
# Defining the categorical preprocessing parameters
##################################
categorical_features = ['Gender','Smoking','Physical_Examination','Adenopathy','Focality','Risk','T','Stage','Response']
categorical_transformer = OrdinalEncoder()
categorical_preprocessor = ColumnTransformer(transformers=[
('cat', categorical_transformer, categorical_features)],
remainder='passthrough',
force_int_remainder_cols=False)
##################################
# Defining the preprocessing and modeling pipeline parameters
##################################
boosted_cb_pipeline = Pipeline([
('categorical_preprocessor', categorical_preprocessor),
('boosted_cb_model', LGBMClassifier(scale_pos_weight=2.0,
random_state=987654321,
subsample =0.7,
colsample_bylevel=0.7))
])
##################################
# Defining hyperparameter grid
##################################
boosted_cb_hyperparameter_grid = {
'boosted_cb_model__learning_rate': [0.01, 0.10],
'boosted_cb_model__max_depth': [3, 6],
'boosted_cb_model__num_leaves': [8, 16],
'boosted_cb_model__iterations': [50, 100]
}
##################################
# Defining the cross-validation strategy (5-cycle 5-fold CV)
##################################
cv_strategy = RepeatedStratifiedKFold(n_splits=5,
n_repeats=5,
random_state=987654321)
##################################
# Performing Grid Search with cross-validation
##################################
boosted_cb_grid_search = GridSearchCV(
estimator=boosted_cb_pipeline,
param_grid=boosted_cb_hyperparameter_grid,
scoring='f1',
cv=cv_strategy,
n_jobs=-1,
verbose=1
)
##################################
# Encoding the response variables
# for model evaluation
##################################
y_encoder = OrdinalEncoder()
y_encoder.fit(y_preprocessed_train.values.reshape(-1, 1))
y_preprocessed_train_encoded = y_encoder.transform(y_preprocessed_train.values.reshape(-1, 1)).ravel()
y_preprocessed_validation_encoded = y_encoder.transform(y_preprocessed_validation.values.reshape(-1, 1)).ravel()
##################################
# Fitting GridSearchCV
##################################
boosted_cb_grid_search.fit(X_preprocessed_train, y_preprocessed_train_encoded)
Fitting 25 folds for each of 16 candidates, totalling 400 fits
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321), estimator=Pipeline(steps=[('categorical_preprocessor', ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])])), ('boosted_cb_model', LGBMClassifier(colsample_bylevel=0.7, random_state=987654321, scale_pos_weight=2.0, subsample=0.7))]), n_jobs=-1, param_grid={'boosted_cb_model__iterations': [50, 100], 'boosted_cb_model__learning_rate': [0.01, 0.1], 'boosted_cb_model__max_depth': [3, 6], 'boosted_cb_model__num_leaves': [8, 16]}, scoring='f1', verbose=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321), estimator=Pipeline(steps=[('categorical_preprocessor', ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])])), ('boosted_cb_model', LGBMClassifier(colsample_bylevel=0.7, random_state=987654321, scale_pos_weight=2.0, subsample=0.7))]), n_jobs=-1, param_grid={'boosted_cb_model__iterations': [50, 100], 'boosted_cb_model__learning_rate': [0.01, 0.1], 'boosted_cb_model__max_depth': [3, 6], 'boosted_cb_model__num_leaves': [8, 16]}, scoring='f1', verbose=1)
Pipeline(steps=[('categorical_preprocessor', ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])])), ('boosted_cb_model', LGBMClassifier(colsample_bylevel=0.7, iterations=50, learning_rate=0.01, max_depth=3, num_leaves=8, random_state=987654321, scale_pos_weight=2.0, subsample=0.7))])
ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])])
['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response']
OrdinalEncoder()
['Age']
passthrough
LGBMClassifier(colsample_bylevel=0.7, iterations=50, learning_rate=0.01, max_depth=3, num_leaves=8, random_state=987654321, scale_pos_weight=2.0, subsample=0.7)
##################################
# Identifying the best model
##################################
boosted_cb_optimal = boosted_cb_grid_search.best_estimator_
##################################
# Evaluating the F1 scores
# on the training, cross-validation, and validation data
##################################
boosted_cb_optimal_f1_cv = boosted_cb_grid_search.best_score_
boosted_cb_optimal_f1_train = f1_score(y_preprocessed_train_encoded, boosted_cb_optimal.predict(X_preprocessed_train))
boosted_cb_optimal_f1_validation = f1_score(y_preprocessed_validation_encoded, boosted_cb_optimal.predict(X_preprocessed_validation))
##################################
# Identifying the optimal model
##################################
print('Best Boosted Model - CatBoost: ')
print(f"Best CatBoost Hyperparameters: {boosted_cb_grid_search.best_params_}")
Best Boosted Model - CatBoost: Best CatBoost Hyperparameters: {'boosted_cb_model__iterations': 50, 'boosted_cb_model__learning_rate': 0.01, 'boosted_cb_model__max_depth': 3, 'boosted_cb_model__num_leaves': 8}
##################################
# Summarizing the F1 score results
# and classification metrics
# on the training and cross-validated data
# to assess overfitting optimism
##################################
print(f"F1 Score on Cross-Validated Data: {boosted_cb_optimal_f1_cv:.4f}")
print(f"F1 Score on Training Data: {boosted_cb_optimal_f1_train:.4f}")
print("\nClassification Report on Train Data:\n", classification_report(y_preprocessed_train_encoded, boosted_cb_optimal.predict(X_preprocessed_train)))
F1 Score on Cross-Validated Data: 0.8277 F1 Score on Training Data: 0.8438 Classification Report on Train Data: precision recall f1-score support 0.0 0.95 0.91 0.93 143 1.0 0.81 0.89 0.84 61 accuracy 0.90 204 macro avg 0.88 0.90 0.89 204 weighted avg 0.91 0.90 0.90 204
##################################
# Formulating the raw and normalized
# confusion matrices
# from the train data
##################################
cm_raw = confusion_matrix(y_preprocessed_train_encoded, boosted_cb_optimal.predict(X_preprocessed_train))
cm_normalized = confusion_matrix(y_preprocessed_train_encoded, boosted_cb_optimal.predict(X_preprocessed_train), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal CatBoost Train Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal CatBoost Train Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
# to assess overfitting optimism
##################################
print(f"F1 Score on Validation Data: {boosted_cb_optimal_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_preprocessed_validation_encoded, boosted_cb_optimal.predict(X_preprocessed_validation)))
F1 Score on Validation Data: 0.8571 Classification Report on Validation Data: precision recall f1-score support 0.0 0.96 0.92 0.94 49 1.0 0.82 0.90 0.86 20 accuracy 0.91 69 macro avg 0.89 0.91 0.90 69 weighted avg 0.92 0.91 0.91 69
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_preprocessed_validation_encoded, boosted_cb_optimal.predict(X_preprocessed_validation))
cm_normalized = confusion_matrix(y_preprocessed_validation_encoded, boosted_cb_optimal.predict(X_preprocessed_validation), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal CatBoost Validation Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal CatBoost Validation Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
##################################
# Gathering the model evaluation metrics
# for the train data
##################################
boosted_cb_optimal_train = model_performance_evaluation(y_preprocessed_train_encoded, boosted_cb_optimal.predict(X_preprocessed_train))
boosted_cb_optimal_train['model'] = ['boosted_cb_optimal'] * 5
boosted_cb_optimal_train['set'] = ['train'] * 5
print('Optimal CatBoost Train Performance Metrics: ')
display(boosted_cb_optimal_train)
Optimal CatBoost Train Performance Metrics:
metric_name | metric_value | model | set | |
---|---|---|---|---|
0 | Accuracy | 0.901961 | boosted_cb_optimal | train |
1 | Precision | 0.805970 | boosted_cb_optimal | train |
2 | Recall | 0.885246 | boosted_cb_optimal | train |
3 | F1 | 0.843750 | boosted_cb_optimal | train |
4 | AUROC | 0.897168 | boosted_cb_optimal | train |
##################################
# Gathering the model evaluation metrics
# for the validation data
##################################
boosted_cb_optimal_validation = model_performance_evaluation(y_preprocessed_validation_encoded, boosted_cb_optimal.predict(X_preprocessed_validation))
boosted_cb_optimal_validation['model'] = ['boosted_cb_optimal'] * 5
boosted_cb_optimal_validation['set'] = ['validation'] * 5
print('Optimal CatBoost Validation Performance Metrics: ')
display(boosted_cb_optimal_validation)
Optimal CatBoost Validation Performance Metrics:
metric_name | metric_value | model | set | |
---|---|---|---|---|
0 | Accuracy | 0.913043 | boosted_cb_optimal | validation |
1 | Precision | 0.818182 | boosted_cb_optimal | validation |
2 | Recall | 0.900000 | boosted_cb_optimal | validation |
3 | F1 | 0.857143 | boosted_cb_optimal | validation |
4 | AUROC | 0.909184 | boosted_cb_optimal | validation |
##################################
# Saving the best individual model
# developed from the train data
##################################
joblib.dump(boosted_cb_optimal,
os.path.join("..", MODELS_PATH, "boosted_model_catboost_optimal.pkl"))
['..\\models\\boosted_model_catboost_optimal.pkl']
1.9. Stacked Model Development ¶
Stacking, or stacked generalization, is an advanced ensemble method that improves predictive performance by training a meta-model to learn the optimal way to combine multiple base models using their out-of-fold predictions. Unlike traditional ensemble techniques such as bagging and boosting, which aggregate predictions through simple rules like averaging or majority voting, stacking introduces a second-level model that intelligently learns how to integrate diverse base models. The process starts by training multiple classifiers on the training dataset. However, instead of directly using their predictions, stacking employs k-fold cross-validation to generate out-of-fold predictions. Specifically, each base model is trained on a subset of the training data while leaving out a validation fold, and predictions on that unseen fold are recorded. This process is repeated across all folds, ensuring that each instance in the training data receives predictions from models that never saw it during training. These out-of-fold predictions are then used as input features for a meta-model, which learns the best way to combine them into a final decision. The advantage of stacking is that it allows different models to complement each other, capturing diverse aspects of the data that a single model might miss. This often results in superior classification accuracy compared to individual models or simpler ensemble approaches. However, stacking is computationally expensive, requiring multiple training iterations for base models and the additional meta-model. It also demands careful tuning to prevent overfitting, as the meta-model’s complexity can introduce new sources of error. Despite these challenges, stacking remains a powerful technique in applications where maximizing predictive performance is a priority.
1.9.1 Base Learner - K-Nearest Neighbors ¶
K-Nearest Neighbors (KNN) is a non-parametric classification algorithm that makes predictions based on the majority class among the k-nearest training samples in feature space. It does not create an explicit model during training; instead, it stores the entire dataset and computes distances between a query point and all training samples during inference. The algorithm follows three key steps: (1) compute the distance between the query point and all training samples (typically using Euclidean distance), (2) identify the k closest points, and (3) assign the most common class among them as the predicted label. KNN is advantageous because it is simple, requires minimal training time, and can model complex decision boundaries when provided with sufficient data. However, it has significant drawbacks: it is computationally expensive for large datasets since distances must be computed for every prediction, it is sensitive to irrelevant or redundant features, and it requires careful selection of k, as a small k can make the model too sensitive to noise while a large k can overly smooth decision boundaries.
- The k-nearest neighbors model from the sklearn.ensemble Python library API was implemented.
- The model contains 3 hyperparameters for tuning:
- n_neighbors = number of neighbors to use made to vary between 3 and 5
- weights = weight function used in prediction made to vary between uniform and distance
- metric = metric to use for distance computation made to vary between minkowski and euclidean
- No any hyperparameter was defined in the model to address the minimal 2:1 class imbalance observed between the No and Yes Recurred categories.
- Hyperparameter tuning was conducted using the 5-cycle 5-fold cross-validation method with optimal model performance using the F1 score determined for:
- n_neighbors = 3
- weights = uniform
- metric = minkowski
- The apparent model performance of the optimal model is summarized as follows:
- Accuracy = 0.9215
- Precision = 0.9090
- Recall = 0.8196
- F1 Score = 0.8620
- AUROC = 0.8923
- The independent validation model performance of the optimal model is summarized as follows:
- Accuracy = 0.8115
- Precision = 0.7058
- Recall = 0.6000
- F1 Score = 0.6486
- AUROC = 0.7489
- Relatively large difference in apparent and independent validation model performance observed that might be indicative of the presence of moderate model overfitting.
##################################
# Defining the categorical preprocessing parameters
##################################
categorical_features = ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response']
categorical_transformer = OrdinalEncoder()
categorical_preprocessor = ColumnTransformer(transformers=[
('cat', categorical_transformer, categorical_features)],
remainder='passthrough',
force_int_remainder_cols=False)
##################################
# Defining the preprocessing and modeling pipeline parameters
##################################
stacked_baselearner_knn_pipeline = Pipeline([
('categorical_preprocessor', categorical_preprocessor),
('stacked_baselearner_knn_model', KNeighborsClassifier())
])
##################################
# Defining hyperparameter grid
##################################
stacked_baselearner_knn_hyperparameter_grid = {
'stacked_baselearner_knn_model__n_neighbors': [3, 5],
'stacked_baselearner_knn_model__weights': ['uniform', 'distance'],
'stacked_baselearner_knn_model__metric': ['minkowski', 'euclidean']
}
##################################
# Defining the cross-validation strategy (5-cycle 5-fold CV)
##################################
cv_strategy = RepeatedStratifiedKFold(n_splits=5,
n_repeats=5,
random_state=987654321)
##################################
# Performing Grid Search with cross-validation
##################################
stacked_baselearner_knn_grid_search = GridSearchCV(
estimator=stacked_baselearner_knn_pipeline,
param_grid=stacked_baselearner_knn_hyperparameter_grid,
scoring='f1',
cv=cv_strategy,
n_jobs=-1,
verbose=1
)
##################################
# Encoding the response variables
##################################
y_encoder = OrdinalEncoder()
y_encoder.fit(y_preprocessed_train.values.reshape(-1, 1))
y_preprocessed_train_encoded = y_encoder.transform(y_preprocessed_train.values.reshape(-1, 1)).ravel()
y_preprocessed_validation_encoded = y_encoder.transform(y_preprocessed_validation.values.reshape(-1, 1)).ravel()
##################################
# Fitting GridSearchCV
##################################
stacked_baselearner_knn_grid_search.fit(X_preprocessed_train, y_preprocessed_train_encoded)
Fitting 25 folds for each of 8 candidates, totalling 200 fits
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321), estimator=Pipeline(steps=[('categorical_preprocessor', ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])])), ('stacked_baselearner_knn_model', KNeighborsClassifier())]), n_jobs=-1, param_grid={'stacked_baselearner_knn_model__metric': ['minkowski', 'euclidean'], 'stacked_baselearner_knn_model__n_neighbors': [3, 5], 'stacked_baselearner_knn_model__weights': ['uniform', 'distance']}, scoring='f1', verbose=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321), estimator=Pipeline(steps=[('categorical_preprocessor', ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])])), ('stacked_baselearner_knn_model', KNeighborsClassifier())]), n_jobs=-1, param_grid={'stacked_baselearner_knn_model__metric': ['minkowski', 'euclidean'], 'stacked_baselearner_knn_model__n_neighbors': [3, 5], 'stacked_baselearner_knn_model__weights': ['uniform', 'distance']}, scoring='f1', verbose=1)
Pipeline(steps=[('categorical_preprocessor', ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])])), ('stacked_baselearner_knn_model', KNeighborsClassifier(n_neighbors=3))])
ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])])
['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response']
OrdinalEncoder()
['Age']
passthrough
KNeighborsClassifier(n_neighbors=3)
##################################
# Identifying the best model
##################################
stacked_baselearner_knn_optimal = stacked_baselearner_knn_grid_search.best_estimator_
##################################
# Evaluating the F1 scores
# on the training, cross-validation, and validation data
##################################
stacked_baselearner_knn_optimal_f1_cv = stacked_baselearner_knn_grid_search.best_score_
stacked_baselearner_knn_optimal_f1_train = f1_score(y_preprocessed_train_encoded, stacked_baselearner_knn_optimal.predict(X_preprocessed_train))
stacked_baselearner_knn_optimal_f1_validation = f1_score(y_preprocessed_validation_encoded, stacked_baselearner_knn_optimal.predict(X_preprocessed_validation))
##################################
# Identifying the optimal model
##################################
print('Best Stacked Base Learner KNN: ')
print(f"Best Stacked Base Learner KNN Hyperparameters: {stacked_baselearner_knn_grid_search.best_params_}")
Best Stacked Base Learner KNN: Best Stacked Base Learner KNN Hyperparameters: {'stacked_baselearner_knn_model__metric': 'minkowski', 'stacked_baselearner_knn_model__n_neighbors': 3, 'stacked_baselearner_knn_model__weights': 'uniform'}
##################################
# Summarizing the F1 score results
# and classification metrics
# on the training and cross-validated data
# to assess overfitting optimism
##################################
print(f"F1 Score on Cross-Validated Data: {stacked_baselearner_knn_optimal_f1_cv:.4f}")
print(f"F1 Score on Training Data: {stacked_baselearner_knn_optimal_f1_train:.4f}")
print("\nClassification Report on Train Data:\n", classification_report(y_preprocessed_train_encoded, stacked_baselearner_knn_optimal.predict(X_preprocessed_train)))
F1 Score on Cross-Validated Data: 0.6417 F1 Score on Training Data: 0.8621 Classification Report on Train Data: precision recall f1-score support 0.0 0.93 0.97 0.95 143 1.0 0.91 0.82 0.86 61 accuracy 0.92 204 macro avg 0.92 0.89 0.90 204 weighted avg 0.92 0.92 0.92 204
##################################
# Formulating the raw and normalized
# confusion matrices
# from the train data
##################################
cm_raw = confusion_matrix(y_preprocessed_train_encoded, stacked_baselearner_knn_optimal.predict(X_preprocessed_train))
cm_normalized = confusion_matrix(y_preprocessed_train_encoded, stacked_baselearner_knn_optimal.predict(X_preprocessed_train), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Stacked Base Learner KNN Train Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Stacked Base Learner KNN Train Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
# to assess overfitting optimism
##################################
print(f"F1 Score on Validation Data: {stacked_baselearner_knn_optimal_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_preprocessed_validation_encoded, stacked_baselearner_knn_optimal.predict(X_preprocessed_validation)))
F1 Score on Validation Data: 0.6486 Classification Report on Validation Data: precision recall f1-score support 0.0 0.85 0.90 0.87 49 1.0 0.71 0.60 0.65 20 accuracy 0.81 69 macro avg 0.78 0.75 0.76 69 weighted avg 0.81 0.81 0.81 69
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_preprocessed_validation_encoded, stacked_baselearner_knn_optimal.predict(X_preprocessed_validation))
cm_normalized = confusion_matrix(y_preprocessed_validation_encoded, stacked_baselearner_knn_optimal.predict(X_preprocessed_validation), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Stacked Base Learner KNN Validation Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Stacked Base Learner KNN Validation Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
##################################
# Gathering the model evaluation metrics
# for the train data
##################################
stacked_baselearner_knn_optimal_train = model_performance_evaluation(y_preprocessed_train_encoded, stacked_baselearner_knn_optimal.predict(X_preprocessed_train))
stacked_baselearner_knn_optimal_train['model'] = ['stacked_baselearner_knn_optimal'] * 5
stacked_baselearner_knn_optimal_train['set'] = ['train'] * 5
print('Optimal Stacked Base Learner KNN Train Performance Metrics: ')
display(stacked_baselearner_knn_optimal_train)
Optimal Stacked Base Learner KNN Train Performance Metrics:
metric_name | metric_value | model | set | |
---|---|---|---|---|
0 | Accuracy | 0.921569 | stacked_baselearner_knn_optimal | train |
1 | Precision | 0.909091 | stacked_baselearner_knn_optimal | train |
2 | Recall | 0.819672 | stacked_baselearner_knn_optimal | train |
3 | F1 | 0.862069 | stacked_baselearner_knn_optimal | train |
4 | AUROC | 0.892354 | stacked_baselearner_knn_optimal | train |
##################################
# Gathering the model evaluation metrics
# for the validation data
##################################
stacked_baselearner_knn_optimal_validation = model_performance_evaluation(y_preprocessed_validation_encoded, stacked_baselearner_knn_optimal.predict(X_preprocessed_validation))
stacked_baselearner_knn_optimal_validation['model'] = ['stacked_baselearner_knn_optimal'] * 5
stacked_baselearner_knn_optimal_validation['set'] = ['validation'] * 5
print('Optimal Stacked Base Learner KNN Validation Performance Metrics: ')
display(stacked_baselearner_knn_optimal_validation)
Optimal Stacked Base Learner KNN Validation Performance Metrics:
metric_name | metric_value | model | set | |
---|---|---|---|---|
0 | Accuracy | 0.811594 | stacked_baselearner_knn_optimal | validation |
1 | Precision | 0.705882 | stacked_baselearner_knn_optimal | validation |
2 | Recall | 0.600000 | stacked_baselearner_knn_optimal | validation |
3 | F1 | 0.648649 | stacked_baselearner_knn_optimal | validation |
4 | AUROC | 0.748980 | stacked_baselearner_knn_optimal | validation |
##################################
# Saving the best individual model
# developed from the train data
##################################
joblib.dump(stacked_baselearner_knn_optimal,
os.path.join("..", MODELS_PATH, "stacked_model_baselearner_knn_optimal.pkl"))
['..\\models\\stacked_model_baselearner_knn_optimal.pkl']
1.9.2 Base Learner - Support Vector Machine ¶
Support Vector Machine (SVM) is a powerful classification algorithm that finds an optimal decision boundary — called a hyperplane — that maximizes the margin between two classes. The algorithm works by identifying the most influential data points, known as support vectors, that define this boundary. If the data is not linearly separable, SVM can use kernel functions to map it into a higher-dimensional space where separation is possible. The main advantages of SVM include strong theoretical guarantees, effectiveness in high-dimensional spaces, and robustness against overfitting when properly regularized. It performs well when the margin between classes is clear and works effectively with small to medium-sized datasets. However, SVM has notable limitations: it is computationally expensive, making it impractical for very large datasets; it requires careful tuning of hyperparameters such as the kernel type and regularization strength; and it is not easily interpretable, as decision boundaries in high-dimensional space can be difficult to visualize.
- The support vector machine model from the sklearn.svm Python library API was implemented.
- The model contains 3 hyperparameters for tuning:
- C = inverse of regularization strength made to vary between 0.1 and 1.0
- kernel = kernel type to be used in the algorithm made to vary between linear and rbf
- gamma = kernel coefficient made to vary between scale and auto
- A special hyperparameter (class_weight = balanced) was fixed to address the minimal 2:1 class imbalance observed between the No and Yes Recurred categories.
- Hyperparameter tuning was conducted using the 5-cycle 5-fold cross-validation method with optimal model performance using the F1 score determined for:
- C = 1.0
- kernel = linear
- gamma = scale
- The apparent model performance of the optimal model is summarized as follows:
- Accuracy = 0.9019
- Precision = 0.8059
- Recall = 0.8852
- F1 Score = 0.8437
- AUROC = 0.8971
- The independent validation model performance of the optimal model is summarized as follows:
- Accuracy = 0.9130
- Precision = 0.8181
- Recall = 0.9000
- F1 Score = 0.8571
- AUROC = 0.9091
- Sufficiently comparable apparent and independent validation model performance observed that might be indicative of the absence of excessive model overfitting.
##################################
# Defining the categorical preprocessing parameters
##################################
categorical_features = ['Gender','Smoking','Physical_Examination','Adenopathy','Focality','Risk','T','Stage','Response']
categorical_transformer = OrdinalEncoder()
categorical_preprocessor = ColumnTransformer(transformers=[
('cat', categorical_transformer, categorical_features)],
remainder='passthrough',
force_int_remainder_cols=False)
##################################
# Defining the preprocessing and modeling pipeline parameters
##################################
stacked_baselearner_svm_pipeline = Pipeline([
('categorical_preprocessor', categorical_preprocessor),
('stacked_baselearner_svm_model', SVC(class_weight='balanced',
random_state=987654321))
])
##################################
# Defining hyperparameter grid
##################################
stacked_baselearner_svm_hyperparameter_grid = {
'stacked_baselearner_svm_model__C': [0.1, 1.0],
'stacked_baselearner_svm_model__kernel': ['linear', 'rbf'],
'stacked_baselearner_svm_model__gamma': ['scale','auto']
}
##################################
# Defining the cross-validation strategy (5-cycle 5-fold CV)
##################################
cv_strategy = RepeatedStratifiedKFold(n_splits=5,
n_repeats=5,
random_state=987654321)
##################################
# Performing Grid Search with cross-validation
##################################
stacked_baselearner_svm_grid_search = GridSearchCV(
estimator=stacked_baselearner_svm_pipeline,
param_grid=stacked_baselearner_svm_hyperparameter_grid,
scoring='f1',
cv=cv_strategy,
n_jobs=-1,
verbose=1
)
##################################
# Encoding the response variables
# for model evaluation
##################################
y_encoder = OrdinalEncoder()
y_encoder.fit(y_preprocessed_train.values.reshape(-1, 1))
y_preprocessed_train_encoded = y_encoder.transform(y_preprocessed_train.values.reshape(-1, 1)).ravel()
y_preprocessed_validation_encoded = y_encoder.transform(y_preprocessed_validation.values.reshape(-1, 1)).ravel()
##################################
# Fitting GridSearchCV
##################################
stacked_baselearner_svm_grid_search.fit(X_preprocessed_train, y_preprocessed_train_encoded)
Fitting 25 folds for each of 8 candidates, totalling 200 fits
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321), estimator=Pipeline(steps=[('categorical_preprocessor', ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])])), ('stacked_baselearner_svm_model', SVC(class_weight='balanced', random_state=987654321))]), n_jobs=-1, param_grid={'stacked_baselearner_svm_model__C': [0.1, 1.0], 'stacked_baselearner_svm_model__gamma': ['scale', 'auto'], 'stacked_baselearner_svm_model__kernel': ['linear', 'rbf']}, scoring='f1', verbose=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321), estimator=Pipeline(steps=[('categorical_preprocessor', ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])])), ('stacked_baselearner_svm_model', SVC(class_weight='balanced', random_state=987654321))]), n_jobs=-1, param_grid={'stacked_baselearner_svm_model__C': [0.1, 1.0], 'stacked_baselearner_svm_model__gamma': ['scale', 'auto'], 'stacked_baselearner_svm_model__kernel': ['linear', 'rbf']}, scoring='f1', verbose=1)
Pipeline(steps=[('categorical_preprocessor', ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])])), ('stacked_baselearner_svm_model', SVC(class_weight='balanced', kernel='linear', random_state=987654321))])
ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])])
['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response']
OrdinalEncoder()
['Age']
passthrough
SVC(class_weight='balanced', kernel='linear', random_state=987654321)
##################################
# Identifying the best model
##################################
stacked_baselearner_svm_optimal = stacked_baselearner_svm_grid_search.best_estimator_
##################################
# Evaluating the F1 scores
# on the training, cross-validation, and validation data
##################################
stacked_baselearner_svm_optimal_f1_cv = stacked_baselearner_svm_grid_search.best_score_
stacked_baselearner_svm_optimal_f1_train = f1_score(y_preprocessed_train_encoded, stacked_baselearner_svm_optimal.predict(X_preprocessed_train))
stacked_baselearner_svm_optimal_f1_validation = f1_score(y_preprocessed_validation_encoded, stacked_baselearner_svm_optimal.predict(X_preprocessed_validation))
##################################
# Identifying the optimal model
##################################
print('Best Stacked Base Learner SVM: ')
print(f"Best Stacked Base Learner SVM Hyperparameters: {stacked_baselearner_svm_grid_search.best_params_}")
Best Stacked Base Learner SVM: Best Stacked Base Learner SVM Hyperparameters: {'stacked_baselearner_svm_model__C': 1.0, 'stacked_baselearner_svm_model__gamma': 'scale', 'stacked_baselearner_svm_model__kernel': 'linear'}
##################################
# Summarizing the F1 score results
# and classification metrics
# on the training and cross-validated data
# to assess overfitting optimism
##################################
print(f"F1 Score on Cross-Validated Data: {stacked_baselearner_svm_optimal_f1_cv:.4f}")
print(f"F1 Score on Training Data: {stacked_baselearner_svm_optimal_f1_train:.4f}")
print("\nClassification Report on Train Data:\n", classification_report(y_preprocessed_train_encoded, stacked_baselearner_svm_optimal.predict(X_preprocessed_train)))
F1 Score on Cross-Validated Data: 0.8219 F1 Score on Training Data: 0.8438 Classification Report on Train Data: precision recall f1-score support 0.0 0.95 0.91 0.93 143 1.0 0.81 0.89 0.84 61 accuracy 0.90 204 macro avg 0.88 0.90 0.89 204 weighted avg 0.91 0.90 0.90 204
##################################
# Formulating the raw and normalized
# confusion matrices
# from the train data
##################################
cm_raw = confusion_matrix(y_preprocessed_train_encoded, stacked_baselearner_svm_optimal.predict(X_preprocessed_train))
cm_normalized = confusion_matrix(y_preprocessed_train_encoded, stacked_baselearner_svm_optimal.predict(X_preprocessed_train), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Stacked Base Learner SVM Train Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Stacked Base Learner SVM Train Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
# to assess overfitting optimism
##################################
print(f"F1 Score on Validation Data: {stacked_baselearner_svm_optimal_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_preprocessed_validation_encoded, stacked_baselearner_svm_optimal.predict(X_preprocessed_validation)))
F1 Score on Validation Data: 0.8571 Classification Report on Validation Data: precision recall f1-score support 0.0 0.96 0.92 0.94 49 1.0 0.82 0.90 0.86 20 accuracy 0.91 69 macro avg 0.89 0.91 0.90 69 weighted avg 0.92 0.91 0.91 69
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_preprocessed_validation_encoded, stacked_baselearner_svm_optimal.predict(X_preprocessed_validation))
cm_normalized = confusion_matrix(y_preprocessed_validation_encoded, stacked_baselearner_svm_optimal.predict(X_preprocessed_validation), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Stacked Base Learner SVM Validation Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Stacked Base Learner SVM Validation Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
##################################
# Gathering the model evaluation metrics
# for the train data
##################################
stacked_baselearner_svm_optimal_train = model_performance_evaluation(y_preprocessed_train_encoded, stacked_baselearner_svm_optimal.predict(X_preprocessed_train))
stacked_baselearner_svm_optimal_train['model'] = ['stacked_baselearner_svm_optimal'] * 5
stacked_baselearner_svm_optimal_train['set'] = ['train'] * 5
print('Optimal Stacked Base Learner SVM Train Performance Metrics: ')
display(stacked_baselearner_svm_optimal_train)
Optimal Stacked Base Learner SVM Train Performance Metrics:
metric_name | metric_value | model | set | |
---|---|---|---|---|
0 | Accuracy | 0.901961 | stacked_baselearner_svm_optimal | train |
1 | Precision | 0.805970 | stacked_baselearner_svm_optimal | train |
2 | Recall | 0.885246 | stacked_baselearner_svm_optimal | train |
3 | F1 | 0.843750 | stacked_baselearner_svm_optimal | train |
4 | AUROC | 0.897168 | stacked_baselearner_svm_optimal | train |
##################################
# Gathering the model evaluation metrics
# for the validation data
##################################
stacked_baselearner_svm_optimal_validation = model_performance_evaluation(y_preprocessed_validation_encoded, stacked_baselearner_svm_optimal.predict(X_preprocessed_validation))
stacked_baselearner_svm_optimal_validation['model'] = ['stacked_baselearner_svm_optimal'] * 5
stacked_baselearner_svm_optimal_validation['set'] = ['validation'] * 5
print('Optimal Stacked Base Learner SVM Validation Performance Metrics: ')
display(stacked_baselearner_svm_optimal_validation)
Optimal Stacked Base Learner SVM Validation Performance Metrics:
metric_name | metric_value | model | set | |
---|---|---|---|---|
0 | Accuracy | 0.913043 | stacked_baselearner_svm_optimal | validation |
1 | Precision | 0.818182 | stacked_baselearner_svm_optimal | validation |
2 | Recall | 0.900000 | stacked_baselearner_svm_optimal | validation |
3 | F1 | 0.857143 | stacked_baselearner_svm_optimal | validation |
4 | AUROC | 0.909184 | stacked_baselearner_svm_optimal | validation |
##################################
# Saving the best individual model
# developed from the train data
##################################
joblib.dump(stacked_baselearner_svm_optimal,
os.path.join("..", MODELS_PATH, "stacked_model_baselearner_svm_optimal.pkl"))
['..\\models\\stacked_model_baselearner_svm_optimal.pkl']
1.9.3 Base Learner - Ridge Classifier ¶
Ridge Classifier is a variation of logistic regression that incorporates L2 regularization to prevent overfitting by penalizing large coefficients in the decision boundary equation. It assumes a linear relationship between the predictor variables and the target class, estimating class probabilities using the logistic function. The key steps include fitting a linear model while adding a penalty term to shrink coefficient values, which reduces variance and improves generalization. Ridge Classifier is particularly useful when dealing with collinear features, as it distributes the importance among correlated variables instead of assigning extreme weights to a few. The advantages of Ridge Classifier include its efficiency, interpretability, and ability to handle high-dimensional data with multicollinearity. However, it has limitations: it assumes a linear decision boundary, making it unsuitable for complex, non-linear relationships, and the regularization parameter requires tuning to balance bias and variance effectively. Additionally, it does not perform feature selection, meaning all input features contribute to the decision-making process, which may reduce interpretability in some cases.
- The ridge classifier model from the sklearn.linear_model Python library API was implemented.
- The model contains 3 hyperparameters for tuning:
- alpha = regularization strength made to vary between 1.0 and 2.0
- solver = solver to use in the computational routines made to vary between sag and saga
- tol = precision of the solution made to vary between 1e-3 and 1e-4
- A special hyperparameter (class_weight = balanced) was fixed to address the minimal 2:1 class imbalance observed between the No and Yes Recurred categories.
- Hyperparameter tuning was conducted using the 5-cycle 5-fold cross-validation method with optimal model performance using the F1 score determined for:
- alpha = 2.0
- solver = saga
- tol = 1e-4
- The apparent model performance of the optimal model is summarized as follows:
- Accuracy = 0.8872
- Precision = 0.7638
- Recall = 0.9016
- F1 Score = 0.8270
- AUROC = 0.8913
- The independent validation model performance of the optimal model is summarized as follows:
- Accuracy = 0.8985
- Precision = 0.7826
- Recall = 0.9000
- F1 Score = 0.8372
- AUROC = 0.8989
- Sufficiently comparable apparent and independent validation model performance observed that might be indicative of the absence of excessive model overfitting.
##################################
# Defining the categorical preprocessing parameters
##################################
categorical_features = ['Gender','Smoking','Physical_Examination','Adenopathy','Focality','Risk','T','Stage','Response']
categorical_transformer = OrdinalEncoder()
categorical_preprocessor = ColumnTransformer(transformers=[
('cat', categorical_transformer, categorical_features)],
remainder='passthrough',
force_int_remainder_cols=False)
##################################
# Defining the preprocessing and modeling pipeline parameters
##################################
stacked_baselearner_rc_pipeline = Pipeline([
('categorical_preprocessor', categorical_preprocessor),
('stacked_baselearner_rc_model', RidgeClassifier(class_weight='balanced',
random_state=987654321))
])
##################################
# Defining hyperparameter grid
##################################
stacked_baselearner_rc_hyperparameter_grid = {
'stacked_baselearner_rc_model__alpha': [1.00, 2.00],
'stacked_baselearner_rc_model__solver': ['sag', 'saga'],
'stacked_baselearner_rc_model__tol': [1e-3, 1e-4]
}
##################################
# Defining the cross-validation strategy (5-cycle 5-fold CV)
##################################
cv_strategy = RepeatedStratifiedKFold(n_splits=5,
n_repeats=5,
random_state=987654321)
##################################
# Performing Grid Search with cross-validation
##################################
stacked_baselearner_rc_grid_search = GridSearchCV(
estimator=stacked_baselearner_rc_pipeline,
param_grid=stacked_baselearner_rc_hyperparameter_grid,
scoring='f1',
cv=cv_strategy,
n_jobs=-1,
verbose=1
)
##################################
# Encoding the response variables
# for model evaluation
##################################
y_encoder = OrdinalEncoder()
y_encoder.fit(y_preprocessed_train.values.reshape(-1, 1))
y_preprocessed_train_encoded = y_encoder.transform(y_preprocessed_train.values.reshape(-1, 1)).ravel()
y_preprocessed_validation_encoded = y_encoder.transform(y_preprocessed_validation.values.reshape(-1, 1)).ravel()
##################################
# Fitting GridSearchCV
##################################
stacked_baselearner_rc_grid_search.fit(X_preprocessed_train, y_preprocessed_train_encoded)
Fitting 25 folds for each of 8 candidates, totalling 200 fits
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321), estimator=Pipeline(steps=[('categorical_preprocessor', ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])])), ('stacked_baselearner_rc_model', RidgeClassifier(class_weight='balanced', random_state=987654321))]), n_jobs=-1, param_grid={'stacked_baselearner_rc_model__alpha': [1.0, 2.0], 'stacked_baselearner_rc_model__solver': ['sag', 'saga'], 'stacked_baselearner_rc_model__tol': [0.001, 0.0001]}, scoring='f1', verbose=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321), estimator=Pipeline(steps=[('categorical_preprocessor', ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])])), ('stacked_baselearner_rc_model', RidgeClassifier(class_weight='balanced', random_state=987654321))]), n_jobs=-1, param_grid={'stacked_baselearner_rc_model__alpha': [1.0, 2.0], 'stacked_baselearner_rc_model__solver': ['sag', 'saga'], 'stacked_baselearner_rc_model__tol': [0.001, 0.0001]}, scoring='f1', verbose=1)
Pipeline(steps=[('categorical_preprocessor', ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])])), ('stacked_baselearner_rc_model', RidgeClassifier(alpha=2.0, class_weight='balanced', random_state=987654321, solver='saga'))])
ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])])
['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response']
OrdinalEncoder()
['Age']
passthrough
RidgeClassifier(alpha=2.0, class_weight='balanced', random_state=987654321, solver='saga')
##################################
# Identifying the best model
##################################
stacked_baselearner_rc_optimal = stacked_baselearner_rc_grid_search.best_estimator_
##################################
# Evaluating the F1 scores
# on the training, cross-validation, and validation data
##################################
stacked_baselearner_rc_optimal_f1_cv = stacked_baselearner_rc_grid_search.best_score_
stacked_baselearner_rc_optimal_f1_train = f1_score(y_preprocessed_train_encoded, stacked_baselearner_rc_optimal.predict(X_preprocessed_train))
stacked_baselearner_rc_optimal_f1_validation = f1_score(y_preprocessed_validation_encoded, stacked_baselearner_rc_optimal.predict(X_preprocessed_validation))
##################################
# Identifying the optimal model
##################################
print('Best Stacked Base Learner Ridge Classifier: ')
print(f"Best Stacked Base Learner Ridge Classifier Hyperparameters: {stacked_baselearner_rc_grid_search.best_params_}")
Best Stacked Base Learner Ridge Classifier: Best Stacked Base Learner Ridge Classifier Hyperparameters: {'stacked_baselearner_rc_model__alpha': 2.0, 'stacked_baselearner_rc_model__solver': 'saga', 'stacked_baselearner_rc_model__tol': 0.0001}
##################################
# Summarizing the F1 score results
# and classification metrics
# on the training and cross-validated data
# to assess overfitting optimism
##################################
print(f"F1 Score on Cross-Validated Data: {stacked_baselearner_rc_optimal_f1_cv:.4f}")
print(f"F1 Score on Training Data: {stacked_baselearner_rc_optimal_f1_train:.4f}")
print("\nClassification Report on Train Data:\n", classification_report(y_preprocessed_train_encoded, stacked_baselearner_rc_optimal.predict(X_preprocessed_train)))
F1 Score on Cross-Validated Data: 0.8097 F1 Score on Training Data: 0.8271 Classification Report on Train Data: precision recall f1-score support 0.0 0.95 0.88 0.92 143 1.0 0.76 0.90 0.83 61 accuracy 0.89 204 macro avg 0.86 0.89 0.87 204 weighted avg 0.90 0.89 0.89 204
##################################
# Formulating the raw and normalized
# confusion matrices
# from the train data
##################################
cm_raw = confusion_matrix(y_preprocessed_train_encoded, stacked_baselearner_rc_optimal.predict(X_preprocessed_train))
cm_normalized = confusion_matrix(y_preprocessed_train_encoded, stacked_baselearner_rc_optimal.predict(X_preprocessed_train), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Stacked Base Learner Ridge Classifier Train Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Stacked Base Learner Ridge Classifier Train Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
# to assess overfitting optimism
##################################
print(f"F1 Score on Validation Data: {stacked_baselearner_rc_optimal_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_preprocessed_validation_encoded, stacked_baselearner_rc_optimal.predict(X_preprocessed_validation)))
F1 Score on Validation Data: 0.8372 Classification Report on Validation Data: precision recall f1-score support 0.0 0.96 0.90 0.93 49 1.0 0.78 0.90 0.84 20 accuracy 0.90 69 macro avg 0.87 0.90 0.88 69 weighted avg 0.91 0.90 0.90 69
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_preprocessed_validation_encoded, stacked_baselearner_rc_optimal.predict(X_preprocessed_validation))
cm_normalized = confusion_matrix(y_preprocessed_validation_encoded, stacked_baselearner_rc_optimal.predict(X_preprocessed_validation), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Stacked Base Learner Ridge Classifier Validation Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Stacked Base Learner Ridge Classifier Validation Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
##################################
# Gathering the model evaluation metrics
# for the train data
##################################
stacked_baselearner_rc_optimal_train = model_performance_evaluation(y_preprocessed_train_encoded, stacked_baselearner_rc_optimal.predict(X_preprocessed_train))
stacked_baselearner_rc_optimal_train['model'] = ['stacked_baselearner_rc_optimal'] * 5
stacked_baselearner_rc_optimal_train['set'] = ['train'] * 5
print('Optimal Stacked Base Learner Ridge Classifier Train Performance Metrics: ')
display(stacked_baselearner_rc_optimal_train)
Optimal Stacked Base Learner Ridge Classifier Train Performance Metrics:
metric_name | metric_value | model | set | |
---|---|---|---|---|
0 | Accuracy | 0.887255 | stacked_baselearner_rc_optimal | train |
1 | Precision | 0.763889 | stacked_baselearner_rc_optimal | train |
2 | Recall | 0.901639 | stacked_baselearner_rc_optimal | train |
3 | F1 | 0.827068 | stacked_baselearner_rc_optimal | train |
4 | AUROC | 0.891379 | stacked_baselearner_rc_optimal | train |
##################################
# Gathering the model evaluation metrics
# for the validation data
##################################
stacked_baselearner_rc_optimal_validation = model_performance_evaluation(y_preprocessed_validation_encoded, stacked_baselearner_rc_optimal.predict(X_preprocessed_validation))
stacked_baselearner_rc_optimal_validation['model'] = ['stacked_baselearner_rc_optimal'] * 5
stacked_baselearner_rc_optimal_validation['set'] = ['validation'] * 5
print('Optimal Stacked Base Learner Ridge Classifier Validation Performance Metrics: ')
display(stacked_baselearner_rc_optimal_validation)
Optimal Stacked Base Learner Ridge Classifier Validation Performance Metrics:
metric_name | metric_value | model | set | |
---|---|---|---|---|
0 | Accuracy | 0.898551 | stacked_baselearner_rc_optimal | validation |
1 | Precision | 0.782609 | stacked_baselearner_rc_optimal | validation |
2 | Recall | 0.900000 | stacked_baselearner_rc_optimal | validation |
3 | F1 | 0.837209 | stacked_baselearner_rc_optimal | validation |
4 | AUROC | 0.898980 | stacked_baselearner_rc_optimal | validation |
##################################
# Saving the best individual model
# developed from the train data
##################################
joblib.dump(stacked_baselearner_rc_optimal,
os.path.join("..", MODELS_PATH, "stacked_model_baselearner_ridge_classifier_optimal.pkl"))
['..\\models\\stacked_model_baselearner_ridge_classifier_optimal.pkl']
1.9.4 Base Learner - Neural Network ¶
Neural Network is a classification algorithm inspired by the human brain, consisting of layers of interconnected neurons that transform input features through weighted connections and activation functions. It learns patterns in data through backpropagation, where the network adjusts its internal weights to minimize classification error. The process involves an input layer receiving data, multiple hidden layers extracting hierarchical features, and an output layer producing a final prediction. The key advantages of neural networks include their ability to model highly complex, non-linear relationships, making them suitable for image, text, and speech classification tasks. They are also highly scalable, capable of handling massive datasets. However, neural networks have several challenges: they require substantial computational resources, especially for deep architectures; they need large amounts of labeled data for effective training; and they are often difficult to interpret due to their "black box" nature. Additionally, hyperparameter tuning, including choosing the number of layers, neurons, and activation functions, is non-trivial and requires careful optimization to prevent overfitting or underfitting.
- The neural network model from the sklearn.neural_network Python library API was implemented.
- The model contains 3 hyperparameters for tuning:
- hidden_layer_sizes = ith element represents the number of neurons in the ith hidden layer made to vary between (50,) and (100,)
- activation = activation function for the hidden layer made to vary between relu and tanh
- alpha = strength of the L2 regularization term made to vary between 0.0001 and 0.001
- No hyperparameter was defined in the model to address the minimal 2:1 class imbalance observed between the No and Yes Recurred categories.
- Hyperparameter tuning was conducted using the 5-cycle 5-fold cross-validation method with optimal model performance using the F1 score determined for:
- hidden_layer_sizes = (50,)
- activation = relu
- alpha = 0.0001
- The apparent model performance of the optimal model is summarized as follows:
- Accuracy = 0.8921
- Precision = 0.8095
- Recall = 0.8360
- F1 Score = 0.8225
- AUROC = 0.8760
- The independent validation model performance of the optimal model is summarized as follows:
- Accuracy = 0.8840
- Precision = 0.7727
- Recall = 0.8500
- F1 Score = 0.8095
- AUROC = 0.8739
- Sufficiently comparable apparent and independent validation model performance observed that might be indicative of the absence of excessive model overfitting.
##################################
# Defining the categorical preprocessing parameters
##################################
categorical_features = ['Gender','Smoking','Physical_Examination','Adenopathy','Focality','Risk','T','Stage','Response']
categorical_transformer = OrdinalEncoder()
categorical_preprocessor = ColumnTransformer(transformers=[
('cat', categorical_transformer, categorical_features)],
remainder='passthrough',
force_int_remainder_cols=False)
##################################
# Defining the preprocessing and modeling pipeline parameters
##################################
stacked_baselearner_nn_pipeline = Pipeline([
('categorical_preprocessor', categorical_preprocessor),
('stacked_baselearner_nn_model', MLPClassifier(max_iter=500,
solver='lbfgs',
early_stopping=False,
random_state=987654321))
])
##################################
# Defining hyperparameter grid
##################################
stacked_baselearner_nn_hyperparameter_grid = {
'stacked_baselearner_nn_model__hidden_layer_sizes': [(50,), (100,)],
'stacked_baselearner_nn_model__activation': ['relu', 'tanh'],
'stacked_baselearner_nn_model__alpha': [0.0001, 0.001]
}
##################################
# Defining the cross-validation strategy (5-cycle 5-fold CV)
##################################
cv_strategy = RepeatedStratifiedKFold(n_splits=5,
n_repeats=5,
random_state=987654321)
##################################
# Performing Grid Search with cross-validation
##################################
stacked_baselearner_nn_grid_search = GridSearchCV(
estimator=stacked_baselearner_nn_pipeline,
param_grid=stacked_baselearner_nn_hyperparameter_grid,
scoring='f1',
cv=cv_strategy,
n_jobs=-1,
verbose=1
)
##################################
# Encoding the response variables
# for model evaluation
##################################
y_encoder = OrdinalEncoder()
y_encoder.fit(y_preprocessed_train.values.reshape(-1, 1))
y_preprocessed_train_encoded = y_encoder.transform(y_preprocessed_train.values.reshape(-1, 1)).ravel()
y_preprocessed_validation_encoded = y_encoder.transform(y_preprocessed_validation.values.reshape(-1, 1)).ravel()
##################################
# Fitting GridSearchCV
##################################
stacked_baselearner_nn_grid_search.fit(X_preprocessed_train, y_preprocessed_train_encoded)
Fitting 25 folds for each of 8 candidates, totalling 200 fits
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321), estimator=Pipeline(steps=[('categorical_preprocessor', ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])])), ('stacked_baselearner_nn_model', MLPClassifier(max_iter=500, random_state=987654321, solver='lbfgs'))]), n_jobs=-1, param_grid={'stacked_baselearner_nn_model__activation': ['relu', 'tanh'], 'stacked_baselearner_nn_model__alpha': [0.0001, 0.001], 'stacked_baselearner_nn_model__hidden_layer_sizes': [(50,), (100,)]}, scoring='f1', verbose=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321), estimator=Pipeline(steps=[('categorical_preprocessor', ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])])), ('stacked_baselearner_nn_model', MLPClassifier(max_iter=500, random_state=987654321, solver='lbfgs'))]), n_jobs=-1, param_grid={'stacked_baselearner_nn_model__activation': ['relu', 'tanh'], 'stacked_baselearner_nn_model__alpha': [0.0001, 0.001], 'stacked_baselearner_nn_model__hidden_layer_sizes': [(50,), (100,)]}, scoring='f1', verbose=1)
Pipeline(steps=[('categorical_preprocessor', ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])])), ('stacked_baselearner_nn_model', MLPClassifier(hidden_layer_sizes=(50,), max_iter=500, random_state=987654321, solver='lbfgs'))])
ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])])
['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response']
OrdinalEncoder()
['Age']
passthrough
MLPClassifier(hidden_layer_sizes=(50,), max_iter=500, random_state=987654321, solver='lbfgs')
##################################
# Identifying the best model
##################################
stacked_baselearner_nn_optimal = stacked_baselearner_nn_grid_search.best_estimator_
##################################
# Evaluating the F1 scores
# on the training, cross-validation, and validation data
##################################
stacked_baselearner_nn_optimal_f1_cv = stacked_baselearner_nn_grid_search.best_score_
stacked_baselearner_nn_optimal_f1_train = f1_score(y_preprocessed_train_encoded, stacked_baselearner_nn_optimal.predict(X_preprocessed_train))
stacked_baselearner_nn_optimal_f1_validation = f1_score(y_preprocessed_validation_encoded, stacked_baselearner_nn_optimal.predict(X_preprocessed_validation))
##################################
# Identifying the optimal model
##################################
print('Best Stacked Base Learner Neural Network: ')
print(f"Best Stacked Base Learner Neural Network Hyperparameters: {stacked_baselearner_nn_grid_search.best_params_}")
Best Stacked Base Learner Neural Network: Best Stacked Base Learner Neural Network Hyperparameters: {'stacked_baselearner_nn_model__activation': 'relu', 'stacked_baselearner_nn_model__alpha': 0.0001, 'stacked_baselearner_nn_model__hidden_layer_sizes': (50,)}
##################################
# Summarizing the F1 score results
# and classification metrics
# on the training and cross-validated data
# to assess overfitting optimism
##################################
print(f"F1 Score on Cross-Validated Data: {stacked_baselearner_nn_optimal_f1_cv:.4f}")
print(f"F1 Score on Training Data: {stacked_baselearner_nn_optimal_f1_train:.4f}")
print("\nClassification Report on Train Data:\n", classification_report(y_preprocessed_train_encoded, stacked_baselearner_nn_optimal.predict(X_preprocessed_train)))
F1 Score on Cross-Validated Data: 0.8063 F1 Score on Training Data: 0.8226 Classification Report on Train Data: precision recall f1-score support 0.0 0.93 0.92 0.92 143 1.0 0.81 0.84 0.82 61 accuracy 0.89 204 macro avg 0.87 0.88 0.87 204 weighted avg 0.89 0.89 0.89 204
##################################
# Formulating the raw and normalized
# confusion matrices
# from the train data
##################################
cm_raw = confusion_matrix(y_preprocessed_train_encoded, stacked_baselearner_nn_optimal.predict(X_preprocessed_train))
cm_normalized = confusion_matrix(y_preprocessed_train_encoded, stacked_baselearner_nn_optimal.predict(X_preprocessed_train), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Stacked Base Learner Neural Network Train Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Stacked Base Learner Neural Network Train Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
# to assess overfitting optimism
##################################
print(f"F1 Score on Validation Data: {stacked_baselearner_nn_optimal_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_preprocessed_validation_encoded, stacked_baselearner_nn_optimal.predict(X_preprocessed_validation)))
F1 Score on Validation Data: 0.8095 Classification Report on Validation Data: precision recall f1-score support 0.0 0.94 0.90 0.92 49 1.0 0.77 0.85 0.81 20 accuracy 0.88 69 macro avg 0.85 0.87 0.86 69 weighted avg 0.89 0.88 0.89 69
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_preprocessed_validation_encoded, stacked_baselearner_nn_optimal.predict(X_preprocessed_validation))
cm_normalized = confusion_matrix(y_preprocessed_validation_encoded, stacked_baselearner_nn_optimal.predict(X_preprocessed_validation), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Stacked Base Learner Neural Network Validation Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Stacked Base Learner Neural Network Validation Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
##################################
# Gathering the model evaluation metrics
# for the train data
##################################
stacked_baselearner_nn_optimal_train = model_performance_evaluation(y_preprocessed_train_encoded, stacked_baselearner_nn_optimal.predict(X_preprocessed_train))
stacked_baselearner_nn_optimal_train['model'] = ['stacked_baselearner_nn_optimal'] * 5
stacked_baselearner_nn_optimal_train['set'] = ['train'] * 5
print('Optimal Stacked Base Learner Neural Network Train Performance Metrics: ')
display(stacked_baselearner_nn_optimal_train)
Optimal Stacked Base Learner Neural Network Train Performance Metrics:
metric_name | metric_value | model | set | |
---|---|---|---|---|
0 | Accuracy | 0.892157 | stacked_baselearner_nn_optimal | train |
1 | Precision | 0.809524 | stacked_baselearner_nn_optimal | train |
2 | Recall | 0.836066 | stacked_baselearner_nn_optimal | train |
3 | F1 | 0.822581 | stacked_baselearner_nn_optimal | train |
4 | AUROC | 0.876075 | stacked_baselearner_nn_optimal | train |
##################################
# Gathering the model evaluation metrics
# for the validation data
##################################
stacked_baselearner_nn_optimal_validation = model_performance_evaluation(y_preprocessed_validation_encoded, stacked_baselearner_nn_optimal.predict(X_preprocessed_validation))
stacked_baselearner_nn_optimal_validation['model'] = ['stacked_baselearner_nn_optimal'] * 5
stacked_baselearner_nn_optimal_validation['set'] = ['validation'] * 5
print('Optimal Stacked Base Learner Neural Network Validation Performance Metrics: ')
display(stacked_baselearner_nn_optimal_validation)
Optimal Stacked Base Learner Neural Network Validation Performance Metrics:
metric_name | metric_value | model | set | |
---|---|---|---|---|
0 | Accuracy | 0.884058 | stacked_baselearner_nn_optimal | validation |
1 | Precision | 0.772727 | stacked_baselearner_nn_optimal | validation |
2 | Recall | 0.850000 | stacked_baselearner_nn_optimal | validation |
3 | F1 | 0.809524 | stacked_baselearner_nn_optimal | validation |
4 | AUROC | 0.873980 | stacked_baselearner_nn_optimal | validation |
##################################
# Saving the best individual model
# developed from the train data
##################################
joblib.dump(stacked_baselearner_nn_optimal,
os.path.join("..", MODELS_PATH, "stacked_model_baselearner_neural_network_optimal.pkl"))
['..\\models\\stacked_model_baselearner_neural_network_optimal.pkl']
1.9.5 Base Learner - Decision Tree ¶
Decision Tree is a hierarchical classification model that recursively splits data based on feature values, forming a tree-like structure where each node represents a decision rule and each leaf represents a class label. The tree is built using a greedy algorithm that selects the best feature at each step based on criteria such as information gain or Gini impurity. The main advantages of decision trees include their interpretability, as the decision-making process can be easily visualized and understood, and their ability to model non-linear relationships without requiring extensive feature engineering. They also handle both numerical and categorical data well. However, decision trees are prone to overfitting, especially when deep trees are grown without pruning. Small changes in the dataset can lead to entirely different structures, making them unstable. Additionally, they tend to perform poorly on highly complex problems where relationships between variables are intricate, making ensemble methods such as Random Forest or Gradient Boosting more effective in practice.
- The decision tree model from the sklearn.tree Python library API was implemented.
- The model contains 4 hyperparameters for tuning:
- criterion = function to measure the quality of a split made to vary between gini and entropy
- max_depth = maximum depth of the tree made to vary between 3 and 6
- min_samples_leaf = minimum number of samples required to be at a leaf node made to vary between 5 and 10
- A special hyperparameter (class_weight = balanced) was fixed to address the minimal 2:1 class imbalance observed between the No and Yes Recurred categories.
- Hyperparameter tuning was conducted using the 5-cycle 5-fold cross-validation method with optimal model performance using the F1 score determined for:
- criterion = gini
- max_depth = 6
- min_samples_leaf = 5
- The apparent model performance of the optimal model is summarized as follows:
- Accuracy = 0.8970
- Precision = 0.7500
- Recall = 0.9836
- F1 Score = 0.8510
- AUROC = 0.9218
- The independent validation model performance of the optimal model is summarized as follows:
- Accuracy = 0.8550
- Precision = 0.6666
- Recall = 1.0000
- F1 Score = 0.8000
- AUROC = 0.8979
- Sufficiently comparable apparent and independent validation model performance observed that might be indicative of the absence of excessive model overfitting.
##################################
# Defining the categorical preprocessing parameters
##################################
categorical_features = ['Gender','Smoking','Physical_Examination','Adenopathy','Focality','Risk','T','Stage','Response']
categorical_transformer = OrdinalEncoder()
categorical_preprocessor = ColumnTransformer(transformers=[
('cat', categorical_transformer, categorical_features)],
remainder='passthrough',
force_int_remainder_cols=False)
##################################
# Defining the preprocessing and modeling pipeline parameters
##################################
stacked_baselearner_dt_pipeline = Pipeline([
('categorical_preprocessor', categorical_preprocessor),
('stacked_baselearner_dt_model', DecisionTreeClassifier(class_weight='balanced',
random_state=987654321))
])
##################################
# Defining hyperparameter grid
##################################
stacked_baselearner_dt_hyperparameter_grid = {
'stacked_baselearner_dt_model__criterion': ['gini', 'entropy'],
'stacked_baselearner_dt_model__max_depth': [3, 6],
'stacked_baselearner_dt_model__min_samples_leaf': [5, 10]
}
##################################
# Defining the cross-validation strategy (5-cycle 5-fold CV)
##################################
cv_strategy = RepeatedStratifiedKFold(n_splits=5,
n_repeats=5,
random_state=987654321)
##################################
# Performing Grid Search with cross-validation
##################################
stacked_baselearner_dt_grid_search = GridSearchCV(
estimator=stacked_baselearner_dt_pipeline,
param_grid=stacked_baselearner_dt_hyperparameter_grid,
scoring='f1',
cv=cv_strategy,
n_jobs=-1,
verbose=1
)
##################################
# Encoding the response variables
# for model evaluation
##################################
y_encoder = OrdinalEncoder()
y_encoder.fit(y_preprocessed_train.values.reshape(-1, 1))
y_preprocessed_train_encoded = y_encoder.transform(y_preprocessed_train.values.reshape(-1, 1)).ravel()
y_preprocessed_validation_encoded = y_encoder.transform(y_preprocessed_validation.values.reshape(-1, 1)).ravel()
##################################
# Fitting GridSearchCV
##################################
stacked_baselearner_dt_grid_search.fit(X_preprocessed_train, y_preprocessed_train_encoded)
Fitting 25 folds for each of 8 candidates, totalling 200 fits
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321), estimator=Pipeline(steps=[('categorical_preprocessor', ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])])), ('stacked_baselearner_dt_model', DecisionTreeClassifier(class_weight='balanced', random_state=987654321))]), n_jobs=-1, param_grid={'stacked_baselearner_dt_model__criterion': ['gini', 'entropy'], 'stacked_baselearner_dt_model__max_depth': [3, 6], 'stacked_baselearner_dt_model__min_samples_leaf': [5, 10]}, scoring='f1', verbose=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321), estimator=Pipeline(steps=[('categorical_preprocessor', ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])])), ('stacked_baselearner_dt_model', DecisionTreeClassifier(class_weight='balanced', random_state=987654321))]), n_jobs=-1, param_grid={'stacked_baselearner_dt_model__criterion': ['gini', 'entropy'], 'stacked_baselearner_dt_model__max_depth': [3, 6], 'stacked_baselearner_dt_model__min_samples_leaf': [5, 10]}, scoring='f1', verbose=1)
Pipeline(steps=[('categorical_preprocessor', ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])])), ('stacked_baselearner_dt_model', DecisionTreeClassifier(class_weight='balanced', max_depth=6, min_samples_leaf=5, random_state=987654321))])
ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])])
['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response']
OrdinalEncoder()
['Age']
passthrough
DecisionTreeClassifier(class_weight='balanced', max_depth=6, min_samples_leaf=5, random_state=987654321)
##################################
# Identifying the best model
##################################
stacked_baselearner_dt_optimal = stacked_baselearner_dt_grid_search.best_estimator_
##################################
# Evaluating the F1 scores
# on the training, cross-validation, and validation data
##################################
stacked_baselearner_dt_optimal_f1_cv = stacked_baselearner_dt_grid_search.best_score_
stacked_baselearner_dt_optimal_f1_train = f1_score(y_preprocessed_train_encoded, stacked_baselearner_dt_optimal.predict(X_preprocessed_train))
stacked_baselearner_dt_optimal_f1_validation = f1_score(y_preprocessed_validation_encoded, stacked_baselearner_dt_optimal.predict(X_preprocessed_validation))
##################################
# Identifying the optimal model
##################################
print('Best Stacked Base Learner Decision Trees: ')
print(f"Best Stacked Base Learner Decision Trees Hyperparameters: {stacked_baselearner_dt_grid_search.best_params_}")
Best Stacked Base Learner Decision Trees: Best Stacked Base Learner Decision Trees Hyperparameters: {'stacked_baselearner_dt_model__criterion': 'gini', 'stacked_baselearner_dt_model__max_depth': 6, 'stacked_baselearner_dt_model__min_samples_leaf': 5}
##################################
# Summarizing the F1 score results
# and classification metrics
# on the training and cross-validated data
# to assess overfitting optimism
##################################
print(f"F1 Score on Cross-Validated Data: {stacked_baselearner_dt_optimal_f1_cv:.4f}")
print(f"F1 Score on Training Data: {stacked_baselearner_dt_optimal_f1_train:.4f}")
print("\nClassification Report on Train Data:\n", classification_report(y_preprocessed_train_encoded, stacked_baselearner_dt_optimal.predict(X_preprocessed_train)))
F1 Score on Cross-Validated Data: 0.8099 F1 Score on Training Data: 0.8511 Classification Report on Train Data: precision recall f1-score support 0.0 0.99 0.86 0.92 143 1.0 0.75 0.98 0.85 61 accuracy 0.90 204 macro avg 0.87 0.92 0.89 204 weighted avg 0.92 0.90 0.90 204
##################################
# Formulating the raw and normalized
# confusion matrices
# from the train data
##################################
cm_raw = confusion_matrix(y_preprocessed_train_encoded, stacked_baselearner_dt_optimal.predict(X_preprocessed_train))
cm_normalized = confusion_matrix(y_preprocessed_train_encoded, stacked_baselearner_dt_optimal.predict(X_preprocessed_train), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Stacked Base Learner Decision Tree Train Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Stacked Base Learner Decision Tree Train Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
# to assess overfitting optimism
##################################
print(f"F1 Score on Validation Data: {stacked_baselearner_dt_optimal_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_preprocessed_validation_encoded, stacked_baselearner_dt_optimal.predict(X_preprocessed_validation)))
F1 Score on Validation Data: 0.8000 Classification Report on Validation Data: precision recall f1-score support 0.0 1.00 0.80 0.89 49 1.0 0.67 1.00 0.80 20 accuracy 0.86 69 macro avg 0.83 0.90 0.84 69 weighted avg 0.90 0.86 0.86 69
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_preprocessed_validation_encoded, stacked_baselearner_dt_optimal.predict(X_preprocessed_validation))
cm_normalized = confusion_matrix(y_preprocessed_validation_encoded, stacked_baselearner_dt_optimal.predict(X_preprocessed_validation), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Stacked Base Learner Decision Tree Validation Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Stacked Base Learner Decision Tree Validation Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
##################################
# Gathering the model evaluation metrics
# for the train data
##################################
stacked_baselearner_dt_optimal_train = model_performance_evaluation(y_preprocessed_train_encoded, stacked_baselearner_dt_optimal.predict(X_preprocessed_train))
stacked_baselearner_dt_optimal_train['model'] = ['stacked_baselearner_dt_optimal'] * 5
stacked_baselearner_dt_optimal_train['set'] = ['train'] * 5
print('Optimal Stacked Base Learner Decision Tree Train Performance Metrics: ')
display(stacked_baselearner_dt_optimal_train)
Optimal Stacked Base Learner Decision Tree Train Performance Metrics:
metric_name | metric_value | model | set | |
---|---|---|---|---|
0 | Accuracy | 0.897059 | stacked_baselearner_dt_optimal | train |
1 | Precision | 0.750000 | stacked_baselearner_dt_optimal | train |
2 | Recall | 0.983607 | stacked_baselearner_dt_optimal | train |
3 | F1 | 0.851064 | stacked_baselearner_dt_optimal | train |
4 | AUROC | 0.921873 | stacked_baselearner_dt_optimal | train |
##################################
# Gathering the model evaluation metrics
# for the validation data
##################################
stacked_baselearner_dt_optimal_validation = model_performance_evaluation(y_preprocessed_validation_encoded, stacked_baselearner_dt_optimal.predict(X_preprocessed_validation))
stacked_baselearner_dt_optimal_validation['model'] = ['stacked_baselearner_dt_optimal'] * 5
stacked_baselearner_dt_optimal_validation['set'] = ['validation'] * 5
print('Optimal Stacked Base Learner Decision Tree Validation Performance Metrics: ')
display(stacked_baselearner_dt_optimal_validation)
Optimal Stacked Base Learner Decision Tree Validation Performance Metrics:
metric_name | metric_value | model | set | |
---|---|---|---|---|
0 | Accuracy | 0.855072 | stacked_baselearner_dt_optimal | validation |
1 | Precision | 0.666667 | stacked_baselearner_dt_optimal | validation |
2 | Recall | 1.000000 | stacked_baselearner_dt_optimal | validation |
3 | F1 | 0.800000 | stacked_baselearner_dt_optimal | validation |
4 | AUROC | 0.897959 | stacked_baselearner_dt_optimal | validation |
##################################
# Saving the best individual model
# developed from the train data
##################################
joblib.dump(stacked_baselearner_dt_optimal,
os.path.join("..", MODELS_PATH, "stacked_model_baselearner_decision_trees_optimal.pkl"))
['..\\models\\stacked_model_baselearner_decision_trees_optimal.pkl']
1.9.6 Meta Learner - Logistic Regression ¶
Logistic Regression is a linear classification algorithm that estimates the probability of a binary outcome using the logistic (sigmoid) function. It assumes a linear relationship between the predictor variables and the log-odds of the target class. The algorithm involves calculating a weighted sum of input features, applying the sigmoid function to transform the result into a probability, and assigning a class label based on a threshold (typically 0.5). Logistic regression is simple, interpretable, and computationally efficient, making it a popular choice for baseline models and problems where relationships between features and the target variable are approximately linear. It also provides insight into feature importance through its learned coefficients. However, logistic regression has limitations: it struggles with non-linear relationships unless feature engineering or polynomial terms are used, it is sensitive to multicollinearity, and it assumes independence between predictor variables, which may not always hold in real-world data. Additionally, it may perform poorly when classes are highly imbalanced, requiring techniques such as weighting or resampling to improve predictions.
- The logistic regression model from the sklearn.linear_model Python library API was implemented.
- The model contains 3 fixed hyperparameters:
- C = inverse of regularization strength held constant at a value of 1.0
- penalty = penalty norm held constant at a value of l2
- solver = algorithm used in the optimization problem held constant at a value of lbfgs
- A special hyperparameter (class_weight = balanced) was fixed to address the minimal 2:1 class imbalance observed between the No and Yes Recurred categories.
- The apparent model performance of the optimal model is summarized as follows:
- Accuracy = 0.9068
- Precision = 0.8088
- Recall = 0.9016
- F1 Score = 0.8527
- AUROC = 0.9053
- The independent validation model performance of the optimal model is summarized as follows:
- Accuracy = 0.9130
- Precision = 0.8181
- Recall = 0.9000
- F1 Score = 0.8571
- AUROC = 0.9091
- Sufficiently comparable apparent and independent validation model performance observed that might be indicative of the absence of excessive model overfitting.
##################################
# Defining the stacking strategy (5-fold CV)
##################################
stacking_strategy = KFold(n_splits=5,
shuffle=True,
random_state=987654321)
##################################
# Loading the pre-trained base learners
# from the previously saved pickle files
##################################
stacked_baselearners = {}
stacked_baselearner_model = ['knn', 'svm', 'ridge_classifier', 'neural_network', 'decision_trees']
for name in stacked_baselearner_model:
stacked_baselearner_model_path = (os.path.join("..", MODELS_PATH, f"stacked_model_baselearner_{name}_optimal.pkl"))
stacked_baselearners[name] = joblib.load(stacked_baselearner_model_path)
##################################
# Initializing the meta-feature matrices
##################################
meta_train_stacked = np.zeros((X_preprocessed_train.shape[0], len(stacked_baselearners)))
meta_validation_stacked = np.zeros((X_preprocessed_validation.shape[0], len(stacked_baselearners)))
##################################
# Generating out-of-fold predictions for training the meta learner
##################################
for i, (name, model) in enumerate(stacked_baselearners.items()):
oof_preds = np.zeros(X_preprocessed_train.shape[0])
validation_fold_preds = np.zeros((X_preprocessed_validation.shape[0], stacking_strategy.get_n_splits()))
for j, (train_idx, val_idx) in enumerate(stacking_strategy.split(X_preprocessed_train)):
model.fit(X_preprocessed_train.iloc[train_idx], y_preprocessed_train_encoded[train_idx])
oof_preds[val_idx] = model.predict_proba(X_preprocessed_train.iloc[val_idx])[:, 1] if hasattr(model, "predict_proba") else model.predict(X_preprocessed_train.iloc[val_idx])
validation_fold_preds[:, j] = model.predict_proba(X_preprocessed_validation)[:, 1] if hasattr(model, "predict_proba") else model.predict(X_preprocessed_validation)
# Extracting the meta-feature matrix for the train data
meta_train_stacked[:, i] = oof_preds
# Extracting the meta-feature matrix for the validation data
# Averaging the validation predictions across folds
meta_validation_stacked[:, i] = validation_fold_preds.mean(axis=1)
##################################
# Training the meta learner on the stacked features
##################################
stacked_metalearner_lr_optimal = LogisticRegression(class_weight='balanced',
penalty='l2',
C=1.0,
solver='lbfgs',
random_state=987654321)
stacked_metalearner_lr_optimal.fit(meta_train_stacked, y_preprocessed_train_encoded)
LogisticRegression(class_weight='balanced', random_state=987654321)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(class_weight='balanced', random_state=987654321)
##################################
# Saving the meta learner model
# developed from the meta-train data
##################################
joblib.dump(stacked_metalearner_lr_optimal,
os.path.join("..", MODELS_PATH, "stacked_model_metalearner_logistic_regression_optimal.pkl"))
['..\\models\\stacked_model_metalearner_logistic_regression_optimal.pkl']
##################################
# Creating a function to extract the
# meta-feature matrices for new data
##################################
def extract_stacked_metafeature_matrix(X_preprocessed_new):
##################################
# Loading the pre-trained base learners
# from the previously saved pickle files
##################################
stacked_baselearners = {}
stacked_baselearner_model = ['knn', 'svm', 'ridge_classifier', 'neural_network', 'decision_trees']
for name in stacked_baselearner_model:
stacked_baselearner_model_path = (os.path.join("..", MODELS_PATH, f"stacked_model_baselearner_{name}_optimal.pkl"))
stacked_baselearners[name] = joblib.load(stacked_baselearner_model_path)
##################################
# Generating meta-features for new data
##################################
meta_train_stacked = np.zeros((X_preprocessed_train.shape[0], len(stacked_baselearners)))
meta_new_stacked = np.zeros((X_preprocessed_new.shape[0], len(stacked_baselearners)))
##################################
# Generating out-of-fold predictions for training the meta learner
##################################
for i, (name, model) in enumerate(stacked_baselearners.items()):
oof_preds = np.zeros(X_preprocessed_train.shape[0])
new_fold_preds = np.zeros((X_preprocessed_new.shape[0], stacking_strategy.get_n_splits()))
for j, (train_idx, val_idx) in enumerate(stacking_strategy.split(X_preprocessed_train)):
model.fit(X_preprocessed_train.iloc[train_idx], y_preprocessed_train_encoded[train_idx])
oof_preds[val_idx] = model.predict_proba(X_preprocessed_train.iloc[val_idx])[:, 1] if hasattr(model, "predict_proba") else model.predict(X_preprocessed_train.iloc[val_idx])
new_fold_preds[:, j] = model.predict_proba(X_preprocessed_new)[:, 1] if hasattr(model, "predict_proba") else model.predict(X_preprocessed_new)
# Extracting the meta-feature matrix for the train data
meta_train_stacked[:, i] = oof_preds
# Extracting the meta-feature matrix for the new data
# Averaging the new predictions across folds
meta_new_stacked[:, i] = new_fold_preds.mean(axis=1)
return meta_new_stacked
##################################
# Evaluating the F1 scores
# on the training and validation data
##################################
stacked_metalearner_lr_optimal_f1_train = f1_score(y_preprocessed_train_encoded, stacked_metalearner_lr_optimal.predict(extract_stacked_metafeature_matrix(X_preprocessed_train)))
stacked_metalearner_lr_optimal_f1_validation = f1_score(y_preprocessed_validation_encoded, stacked_metalearner_lr_optimal.predict(extract_stacked_metafeature_matrix(X_preprocessed_validation)))
##################################
# Summarizing the F1 score results
# and classification metrics
# on the training data
# to assess overfitting optimism
##################################
print(f"F1 Score on Training Data: {stacked_metalearner_lr_optimal_f1_train:.4f}")
print("\nClassification Report on Train Data:\n", classification_report(y_preprocessed_train_encoded, stacked_metalearner_lr_optimal.predict(extract_stacked_metafeature_matrix(X_preprocessed_train))))
F1 Score on Training Data: 0.8527 Classification Report on Train Data: precision recall f1-score support 0.0 0.96 0.91 0.93 143 1.0 0.81 0.90 0.85 61 accuracy 0.91 204 macro avg 0.88 0.91 0.89 204 weighted avg 0.91 0.91 0.91 204
##################################
# Formulating the raw and normalized
# confusion matrices
# from the train data
##################################
cm_raw = confusion_matrix(y_preprocessed_train_encoded, stacked_metalearner_lr_optimal.predict(extract_stacked_metafeature_matrix(X_preprocessed_train)))
cm_normalized = confusion_matrix(y_preprocessed_train_encoded, stacked_metalearner_lr_optimal.predict(extract_stacked_metafeature_matrix(X_preprocessed_train)), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Stacked Meta Learner Logistic Regression Train Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Stacked Meta Learner Logistic Regression Train Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
# to assess overfitting optimism
##################################
print(f"F1 Score on Validationing Data: {stacked_metalearner_lr_optimal_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_preprocessed_validation_encoded, stacked_metalearner_lr_optimal.predict(extract_stacked_metafeature_matrix(X_preprocessed_validation))))
F1 Score on Validationing Data: 0.8571 Classification Report on Validation Data: precision recall f1-score support 0.0 0.96 0.92 0.94 49 1.0 0.82 0.90 0.86 20 accuracy 0.91 69 macro avg 0.89 0.91 0.90 69 weighted avg 0.92 0.91 0.91 69
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_preprocessed_validation_encoded, stacked_metalearner_lr_optimal.predict(extract_stacked_metafeature_matrix(X_preprocessed_validation)))
cm_normalized = confusion_matrix(y_preprocessed_validation_encoded, stacked_metalearner_lr_optimal.predict(extract_stacked_metafeature_matrix(X_preprocessed_validation)), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Stacked Meta Learner Logistic Regression Validation Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Stacked Meta Learner Logistic Regression Validation Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
##################################
# Gathering the model evaluation metrics
# for the train data
##################################
stacked_metalearner_lr_optimal_train = model_performance_evaluation(y_preprocessed_train_encoded, stacked_metalearner_lr_optimal.predict(extract_stacked_metafeature_matrix(X_preprocessed_train)))
stacked_metalearner_lr_optimal_train['model'] = ['stacked_metalearner_lr_optimal'] * 5
stacked_metalearner_lr_optimal_train['set'] = ['train'] * 5
print('Optimal Stacked Meta Learner Logistic Regression Train Performance Metrics: ')
display(stacked_metalearner_lr_optimal_train)
Optimal Stacked Meta Learner Logistic Regression Train Performance Metrics:
metric_name | metric_value | model | set | |
---|---|---|---|---|
0 | Accuracy | 0.906863 | stacked_metalearner_lr_optimal | train |
1 | Precision | 0.808824 | stacked_metalearner_lr_optimal | train |
2 | Recall | 0.901639 | stacked_metalearner_lr_optimal | train |
3 | F1 | 0.852713 | stacked_metalearner_lr_optimal | train |
4 | AUROC | 0.905365 | stacked_metalearner_lr_optimal | train |
##################################
# Gathering the model evaluation metrics
# for the validation data
##################################
stacked_metalearner_lr_optimal_validation = model_performance_evaluation(y_preprocessed_validation_encoded, stacked_metalearner_lr_optimal.predict(extract_stacked_metafeature_matrix(X_preprocessed_validation)))
stacked_metalearner_lr_optimal_validation['model'] = ['stacked_metalearner_lr_optimal'] * 5
stacked_metalearner_lr_optimal_validation['set'] = ['validation'] * 5
print('Optimal Stacked Meta Learner Logistic Regression Validation Performance Metrics: ')
display(stacked_metalearner_lr_optimal_validation)
Optimal Stacked Meta Learner Logistic Regression Validation Performance Metrics:
metric_name | metric_value | model | set | |
---|---|---|---|---|
0 | Accuracy | 0.913043 | stacked_metalearner_lr_optimal | validation |
1 | Precision | 0.818182 | stacked_metalearner_lr_optimal | validation |
2 | Recall | 0.900000 | stacked_metalearner_lr_optimal | validation |
3 | F1 | 0.857143 | stacked_metalearner_lr_optimal | validation |
4 | AUROC | 0.909184 | stacked_metalearner_lr_optimal | validation |
1.10. Blended Model Development ¶
Blending is an ensemble technique that enhances classification accuracy by training a meta-model on a holdout validation set, rather than using out-of-fold predictions like stacking. This simplifies implementation while maintaining the benefits of combining multiple base models. The process of blending starts by training base models on the full training dataset. Instead of applying cross-validation to obtain out-of-fold predictions, blending reserves a small portion of the training data as a holdout set. The base models make predictions on this unseen holdout set, and these predictions are then used as input features for a meta-model, which learns how to optimally combine them into a final classification decision. Since the meta-model is trained on predictions from unseen data, it avoids the risk of overfitting that can sometimes occur when base models are evaluated on the same data they were trained on. Blending is motivated by its simplicity and ease of implementation compared to stacking, as it eliminates the need for repeated k-fold cross-validation to generate training data for the meta-model. However, one drawback is that the meta-model has access to fewer training examples, as a portion of the data is withheld for validation rather than being used for training. This can limit the generalization ability of the final model, especially if the holdout set is too small. Despite this limitation, blending remains a useful approach in applications where a quick and effective ensemble method is needed without the computational overhead of stacking.
1.10.1 Base Learner - K-Nearest Neighbors ¶
K-Nearest Neighbors (KNN) is a non-parametric classification algorithm that makes predictions based on the majority class among the k-nearest training samples in feature space. It does not create an explicit model during training; instead, it stores the entire dataset and computes distances between a query point and all training samples during inference. The algorithm follows three key steps: (1) compute the distance between the query point and all training samples (typically using Euclidean distance), (2) identify the k closest points, and (3) assign the most common class among them as the predicted label. KNN is advantageous because it is simple, requires minimal training time, and can model complex decision boundaries when provided with sufficient data. However, it has significant drawbacks: it is computationally expensive for large datasets since distances must be computed for every prediction, it is sensitive to irrelevant or redundant features, and it requires careful selection of k, as a small k can make the model too sensitive to noise while a large k can overly smooth decision boundaries.
- The k-nearest neighbors model from the sklearn.ensemble Python library API was implemented.
- The model contains 3 hyperparameters for tuning:
- n_neighbors = number of neighbors to use made to vary between 3 and 5
- weights = weight function used in prediction made to vary between uniform and distance
- metric = metric to use for distance computation made to vary between minkowski and euclidean
- No any hyperparameter was defined in the model to address the minimal 2:1 class imbalance observed between the No and Yes Recurred categories.
- Hyperparameter tuning was conducted using the 5-cycle 5-fold cross-validation method with optimal model performance using the F1 score determined for:
- n_neighbors = 3
- weights = uniform
- metric = minkowski
- The apparent model performance of the optimal model is summarized as follows:
- Accuracy = 0.9215
- Precision = 0.9090
- Recall = 0.8196
- F1 Score = 0.8620
- AUROC = 0.8923
- The independent validation model performance of the optimal model is summarized as follows:
- Accuracy = 0.8115
- Precision = 0.7058
- Recall = 0.6000
- F1 Score = 0.6486
- AUROC = 0.7489
- Relatively large difference in apparent and independent validation model performance observed that might be indicative of the presence of moderate model overfitting.
##################################
# Defining the categorical preprocessing parameters
##################################
categorical_features = ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response']
categorical_transformer = OrdinalEncoder()
categorical_preprocessor = ColumnTransformer(transformers=[
('cat', categorical_transformer, categorical_features)],
remainder='passthrough',
force_int_remainder_cols=False)
##################################
# Defining the preprocessing and modeling pipeline parameters
##################################
blended_baselearner_knn_pipeline = Pipeline([
('categorical_preprocessor', categorical_preprocessor),
('blended_baselearner_knn_model', KNeighborsClassifier())
])
##################################
# Defining hyperparameter grid
##################################
blended_baselearner_knn_hyperparameter_grid = {
'blended_baselearner_knn_model__n_neighbors': [3, 5],
'blended_baselearner_knn_model__weights': ['uniform', 'distance'],
'blended_baselearner_knn_model__metric': ['minkowski', 'euclidean']
}
##################################
# Defining the cross-validation strategy (5-cycle 5-fold CV)
##################################
cv_strategy = RepeatedStratifiedKFold(n_splits=5,
n_repeats=5,
random_state=987654321)
##################################
# Performing Grid Search with cross-validation
##################################
blended_baselearner_knn_grid_search = GridSearchCV(
estimator=blended_baselearner_knn_pipeline,
param_grid=blended_baselearner_knn_hyperparameter_grid,
scoring='f1',
cv=cv_strategy,
n_jobs=-1,
verbose=1
)
##################################
# Encoding the response variables
##################################
y_encoder = OrdinalEncoder()
y_encoder.fit(y_preprocessed_train.values.reshape(-1, 1))
y_preprocessed_train_encoded = y_encoder.transform(y_preprocessed_train.values.reshape(-1, 1)).ravel()
y_preprocessed_validation_encoded = y_encoder.transform(y_preprocessed_validation.values.reshape(-1, 1)).ravel()
##################################
# Fitting GridSearchCV
##################################
blended_baselearner_knn_grid_search.fit(X_preprocessed_train, y_preprocessed_train_encoded)
Fitting 25 folds for each of 8 candidates, totalling 200 fits
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321), estimator=Pipeline(steps=[('categorical_preprocessor', ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])])), ('blended_baselearner_knn_model', KNeighborsClassifier())]), n_jobs=-1, param_grid={'blended_baselearner_knn_model__metric': ['minkowski', 'euclidean'], 'blended_baselearner_knn_model__n_neighbors': [3, 5], 'blended_baselearner_knn_model__weights': ['uniform', 'distance']}, scoring='f1', verbose=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321), estimator=Pipeline(steps=[('categorical_preprocessor', ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])])), ('blended_baselearner_knn_model', KNeighborsClassifier())]), n_jobs=-1, param_grid={'blended_baselearner_knn_model__metric': ['minkowski', 'euclidean'], 'blended_baselearner_knn_model__n_neighbors': [3, 5], 'blended_baselearner_knn_model__weights': ['uniform', 'distance']}, scoring='f1', verbose=1)
Pipeline(steps=[('categorical_preprocessor', ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])])), ('blended_baselearner_knn_model', KNeighborsClassifier(n_neighbors=3))])
ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])])
['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response']
OrdinalEncoder()
['Age']
passthrough
KNeighborsClassifier(n_neighbors=3)
##################################
# Identifying the best model
##################################
blended_baselearner_knn_optimal = blended_baselearner_knn_grid_search.best_estimator_
##################################
# Evaluating the F1 scores
# on the training, cross-validation, and validation data
##################################
blended_baselearner_knn_optimal_f1_cv = blended_baselearner_knn_grid_search.best_score_
blended_baselearner_knn_optimal_f1_train = f1_score(y_preprocessed_train_encoded, blended_baselearner_knn_optimal.predict(X_preprocessed_train))
blended_baselearner_knn_optimal_f1_validation = f1_score(y_preprocessed_validation_encoded, blended_baselearner_knn_optimal.predict(X_preprocessed_validation))
##################################
# Identifying the optimal model
##################################
print('Best Blended Base Learner KNN: ')
print(f"Best Blended Base Learner KNN Hyperparameters: {blended_baselearner_knn_grid_search.best_params_}")
Best Blended Base Learner KNN: Best Blended Base Learner KNN Hyperparameters: {'blended_baselearner_knn_model__metric': 'minkowski', 'blended_baselearner_knn_model__n_neighbors': 3, 'blended_baselearner_knn_model__weights': 'uniform'}
##################################
# Summarizing the F1 score results
# and classification metrics
# on the training and cross-validated data
# to assess overfitting optimism
##################################
print(f"F1 Score on Cross-Validated Data: {blended_baselearner_knn_optimal_f1_cv:.4f}")
print(f"F1 Score on Training Data: {blended_baselearner_knn_optimal_f1_train:.4f}")
print("\nClassification Report on Train Data:\n", classification_report(y_preprocessed_train_encoded, blended_baselearner_knn_optimal.predict(X_preprocessed_train)))
F1 Score on Cross-Validated Data: 0.6417 F1 Score on Training Data: 0.8621 Classification Report on Train Data: precision recall f1-score support 0.0 0.93 0.97 0.95 143 1.0 0.91 0.82 0.86 61 accuracy 0.92 204 macro avg 0.92 0.89 0.90 204 weighted avg 0.92 0.92 0.92 204
##################################
# Formulating the raw and normalized
# confusion matrices
# from the train data
##################################
cm_raw = confusion_matrix(y_preprocessed_train_encoded, blended_baselearner_knn_optimal.predict(X_preprocessed_train))
cm_normalized = confusion_matrix(y_preprocessed_train_encoded, blended_baselearner_knn_optimal.predict(X_preprocessed_train), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Blended Base Learner KNN Train Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Blended Base Learner KNN Train Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
# to assess overfitting optimism
##################################
print(f"F1 Score on Validation Data: {blended_baselearner_knn_optimal_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_preprocessed_validation_encoded, blended_baselearner_knn_optimal.predict(X_preprocessed_validation)))
F1 Score on Validation Data: 0.6486 Classification Report on Validation Data: precision recall f1-score support 0.0 0.85 0.90 0.87 49 1.0 0.71 0.60 0.65 20 accuracy 0.81 69 macro avg 0.78 0.75 0.76 69 weighted avg 0.81 0.81 0.81 69
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_preprocessed_validation_encoded, blended_baselearner_knn_optimal.predict(X_preprocessed_validation))
cm_normalized = confusion_matrix(y_preprocessed_validation_encoded, blended_baselearner_knn_optimal.predict(X_preprocessed_validation), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Blended Base Learner KNN Validation Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Blended Base Learner KNN Validation Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
##################################
# Gathering the model evaluation metrics
# for the train data
##################################
blended_baselearner_knn_optimal_train = model_performance_evaluation(y_preprocessed_train_encoded, blended_baselearner_knn_optimal.predict(X_preprocessed_train))
blended_baselearner_knn_optimal_train['model'] = ['blended_baselearner_knn_optimal'] * 5
blended_baselearner_knn_optimal_train['set'] = ['train'] * 5
print('Optimal Blended Base Learner KNN Train Performance Metrics: ')
display(blended_baselearner_knn_optimal_train)
Optimal Blended Base Learner KNN Train Performance Metrics:
metric_name | metric_value | model | set | |
---|---|---|---|---|
0 | Accuracy | 0.921569 | blended_baselearner_knn_optimal | train |
1 | Precision | 0.909091 | blended_baselearner_knn_optimal | train |
2 | Recall | 0.819672 | blended_baselearner_knn_optimal | train |
3 | F1 | 0.862069 | blended_baselearner_knn_optimal | train |
4 | AUROC | 0.892354 | blended_baselearner_knn_optimal | train |
##################################
# Gathering the model evaluation metrics
# for the validation data
##################################
blended_baselearner_knn_optimal_validation = model_performance_evaluation(y_preprocessed_validation_encoded, blended_baselearner_knn_optimal.predict(X_preprocessed_validation))
blended_baselearner_knn_optimal_validation['model'] = ['blended_baselearner_knn_optimal'] * 5
blended_baselearner_knn_optimal_validation['set'] = ['validation'] * 5
print('Optimal Blended Base Learner KNN Validation Performance Metrics: ')
display(blended_baselearner_knn_optimal_validation)
Optimal Blended Base Learner KNN Validation Performance Metrics:
metric_name | metric_value | model | set | |
---|---|---|---|---|
0 | Accuracy | 0.811594 | blended_baselearner_knn_optimal | validation |
1 | Precision | 0.705882 | blended_baselearner_knn_optimal | validation |
2 | Recall | 0.600000 | blended_baselearner_knn_optimal | validation |
3 | F1 | 0.648649 | blended_baselearner_knn_optimal | validation |
4 | AUROC | 0.748980 | blended_baselearner_knn_optimal | validation |
##################################
# Saving the best individual model
# developed from the train data
##################################
joblib.dump(blended_baselearner_knn_optimal,
os.path.join("..", MODELS_PATH, "blended_model_baselearner_knn_optimal.pkl"))
['..\\models\\blended_model_baselearner_knn_optimal.pkl']
1.10.2 Base Learner - Support Vector Machine ¶
Support Vector Machine (SVM) is a powerful classification algorithm that finds an optimal decision boundary — called a hyperplane — that maximizes the margin between two classes. The algorithm works by identifying the most influential data points, known as support vectors, that define this boundary. If the data is not linearly separable, SVM can use kernel functions to map it into a higher-dimensional space where separation is possible. The main advantages of SVM include strong theoretical guarantees, effectiveness in high-dimensional spaces, and robustness against overfitting when properly regularized. It performs well when the margin between classes is clear and works effectively with small to medium-sized datasets. However, SVM has notable limitations: it is computationally expensive, making it impractical for very large datasets; it requires careful tuning of hyperparameters such as the kernel type and regularization strength; and it is not easily interpretable, as decision boundaries in high-dimensional space can be difficult to visualize.
- The support vector machine model from the sklearn.svm Python library API was implemented.
- The model contains 3 hyperparameters for tuning:
- C = inverse of regularization strength made to vary between 0.1 and 1.0
- kernel = kernel type to be used in the algorithm made to vary between linear and rbf
- gamma = kernel coefficient made to vary between scale and auto
- A special hyperparameter (class_weight = balanced) was fixed to address the minimal 2:1 class imbalance observed between the No and Yes Recurred categories.
- Hyperparameter tuning was conducted using the 5-cycle 5-fold cross-validation method with optimal model performance using the F1 score determined for:
- C = 1.0
- kernel = linear
- gamma = scale
- The apparent model performance of the optimal model is summarized as follows:
- Accuracy = 0.9019
- Precision = 0.8059
- Recall = 0.8852
- F1 Score = 0.8437
- AUROC = 0.8971
- The independent validation model performance of the optimal model is summarized as follows:
- Accuracy = 0.9130
- Precision = 0.8181
- Recall = 0.9000
- F1 Score = 0.8571
- AUROC = 0.9091
- Sufficiently comparable apparent and independent validation model performance observed that might be indicative of the absence of excessive model overfitting.
##################################
# Defining the categorical preprocessing parameters
##################################
categorical_features = ['Gender','Smoking','Physical_Examination','Adenopathy','Focality','Risk','T','Stage','Response']
categorical_transformer = OrdinalEncoder()
categorical_preprocessor = ColumnTransformer(transformers=[
('cat', categorical_transformer, categorical_features)],
remainder='passthrough',
force_int_remainder_cols=False)
##################################
# Defining the preprocessing and modeling pipeline parameters
##################################
blended_baselearner_svm_pipeline = Pipeline([
('categorical_preprocessor', categorical_preprocessor),
('blended_baselearner_svm_model', SVC(class_weight='balanced',
random_state=987654321))
])
##################################
# Defining hyperparameter grid
##################################
blended_baselearner_svm_hyperparameter_grid = {
'blended_baselearner_svm_model__C': [0.1, 1.0],
'blended_baselearner_svm_model__kernel': ['linear', 'rbf'],
'blended_baselearner_svm_model__gamma': ['scale','auto']
}
##################################
# Defining the cross-validation strategy (5-cycle 5-fold CV)
##################################
cv_strategy = RepeatedStratifiedKFold(n_splits=5,
n_repeats=5,
random_state=987654321)
##################################
# Performing Grid Search with cross-validation
##################################
blended_baselearner_svm_grid_search = GridSearchCV(
estimator=blended_baselearner_svm_pipeline,
param_grid=blended_baselearner_svm_hyperparameter_grid,
scoring='f1',
cv=cv_strategy,
n_jobs=-1,
verbose=1
)
##################################
# Encoding the response variables
# for model evaluation
##################################
y_encoder = OrdinalEncoder()
y_encoder.fit(y_preprocessed_train.values.reshape(-1, 1))
y_preprocessed_train_encoded = y_encoder.transform(y_preprocessed_train.values.reshape(-1, 1)).ravel()
y_preprocessed_validation_encoded = y_encoder.transform(y_preprocessed_validation.values.reshape(-1, 1)).ravel()
##################################
# Fitting GridSearchCV
##################################
blended_baselearner_svm_grid_search.fit(X_preprocessed_train, y_preprocessed_train_encoded)
Fitting 25 folds for each of 8 candidates, totalling 200 fits
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321), estimator=Pipeline(steps=[('categorical_preprocessor', ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])])), ('blended_baselearner_svm_model', SVC(class_weight='balanced', random_state=987654321))]), n_jobs=-1, param_grid={'blended_baselearner_svm_model__C': [0.1, 1.0], 'blended_baselearner_svm_model__gamma': ['scale', 'auto'], 'blended_baselearner_svm_model__kernel': ['linear', 'rbf']}, scoring='f1', verbose=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321), estimator=Pipeline(steps=[('categorical_preprocessor', ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])])), ('blended_baselearner_svm_model', SVC(class_weight='balanced', random_state=987654321))]), n_jobs=-1, param_grid={'blended_baselearner_svm_model__C': [0.1, 1.0], 'blended_baselearner_svm_model__gamma': ['scale', 'auto'], 'blended_baselearner_svm_model__kernel': ['linear', 'rbf']}, scoring='f1', verbose=1)
Pipeline(steps=[('categorical_preprocessor', ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])])), ('blended_baselearner_svm_model', SVC(class_weight='balanced', kernel='linear', random_state=987654321))])
ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])])
['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response']
OrdinalEncoder()
['Age']
passthrough
SVC(class_weight='balanced', kernel='linear', random_state=987654321)
##################################
# Identifying the best model
##################################
blended_baselearner_svm_optimal = blended_baselearner_svm_grid_search.best_estimator_
##################################
# Evaluating the F1 scores
# on the training, cross-validation, and validation data
##################################
blended_baselearner_svm_optimal_f1_cv = blended_baselearner_svm_grid_search.best_score_
blended_baselearner_svm_optimal_f1_train = f1_score(y_preprocessed_train_encoded, blended_baselearner_svm_optimal.predict(X_preprocessed_train))
blended_baselearner_svm_optimal_f1_validation = f1_score(y_preprocessed_validation_encoded, blended_baselearner_svm_optimal.predict(X_preprocessed_validation))
##################################
# Identifying the optimal model
##################################
print('Best Blended Base Learner SVM: ')
print(f"Best Blended Base Learner SVM Hyperparameters: {blended_baselearner_svm_grid_search.best_params_}")
Best Blended Base Learner SVM: Best Blended Base Learner SVM Hyperparameters: {'blended_baselearner_svm_model__C': 1.0, 'blended_baselearner_svm_model__gamma': 'scale', 'blended_baselearner_svm_model__kernel': 'linear'}
##################################
# Summarizing the F1 score results
# and classification metrics
# on the training and cross-validated data
# to assess overfitting optimism
##################################
print(f"F1 Score on Cross-Validated Data: {blended_baselearner_svm_optimal_f1_cv:.4f}")
print(f"F1 Score on Training Data: {blended_baselearner_svm_optimal_f1_train:.4f}")
print("\nClassification Report on Train Data:\n", classification_report(y_preprocessed_train_encoded, blended_baselearner_svm_optimal.predict(X_preprocessed_train)))
F1 Score on Cross-Validated Data: 0.8219 F1 Score on Training Data: 0.8438 Classification Report on Train Data: precision recall f1-score support 0.0 0.95 0.91 0.93 143 1.0 0.81 0.89 0.84 61 accuracy 0.90 204 macro avg 0.88 0.90 0.89 204 weighted avg 0.91 0.90 0.90 204
##################################
# Formulating the raw and normalized
# confusion matrices
# from the train data
##################################
cm_raw = confusion_matrix(y_preprocessed_train_encoded, blended_baselearner_svm_optimal.predict(X_preprocessed_train))
cm_normalized = confusion_matrix(y_preprocessed_train_encoded, blended_baselearner_svm_optimal.predict(X_preprocessed_train), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Blended Base Learner SVM Train Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Blended Base Learner SVM Train Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
# to assess overfitting optimism
##################################
print(f"F1 Score on Validation Data: {blended_baselearner_svm_optimal_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_preprocessed_validation_encoded, blended_baselearner_svm_optimal.predict(X_preprocessed_validation)))
F1 Score on Validation Data: 0.8571 Classification Report on Validation Data: precision recall f1-score support 0.0 0.96 0.92 0.94 49 1.0 0.82 0.90 0.86 20 accuracy 0.91 69 macro avg 0.89 0.91 0.90 69 weighted avg 0.92 0.91 0.91 69
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_preprocessed_validation_encoded, blended_baselearner_svm_optimal.predict(X_preprocessed_validation))
cm_normalized = confusion_matrix(y_preprocessed_validation_encoded, blended_baselearner_svm_optimal.predict(X_preprocessed_validation), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Blended Base Learner SVM Validation Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Blended Base Learner SVM Validation Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
##################################
# Gathering the model evaluation metrics
# for the train data
##################################
blended_baselearner_svm_optimal_train = model_performance_evaluation(y_preprocessed_train_encoded, blended_baselearner_svm_optimal.predict(X_preprocessed_train))
blended_baselearner_svm_optimal_train['model'] = ['blended_baselearner_svm_optimal'] * 5
blended_baselearner_svm_optimal_train['set'] = ['train'] * 5
print('Optimal Blended Base Learner SVM Train Performance Metrics: ')
display(blended_baselearner_svm_optimal_train)
Optimal Blended Base Learner SVM Train Performance Metrics:
metric_name | metric_value | model | set | |
---|---|---|---|---|
0 | Accuracy | 0.901961 | blended_baselearner_svm_optimal | train |
1 | Precision | 0.805970 | blended_baselearner_svm_optimal | train |
2 | Recall | 0.885246 | blended_baselearner_svm_optimal | train |
3 | F1 | 0.843750 | blended_baselearner_svm_optimal | train |
4 | AUROC | 0.897168 | blended_baselearner_svm_optimal | train |
##################################
# Gathering the model evaluation metrics
# for the validation data
##################################
blended_baselearner_svm_optimal_validation = model_performance_evaluation(y_preprocessed_validation_encoded, blended_baselearner_svm_optimal.predict(X_preprocessed_validation))
blended_baselearner_svm_optimal_validation['model'] = ['blended_baselearner_svm_optimal'] * 5
blended_baselearner_svm_optimal_validation['set'] = ['validation'] * 5
print('Optimal Blended Base Learner SVM Validation Performance Metrics: ')
display(blended_baselearner_svm_optimal_validation)
Optimal Blended Base Learner SVM Validation Performance Metrics:
metric_name | metric_value | model | set | |
---|---|---|---|---|
0 | Accuracy | 0.913043 | blended_baselearner_svm_optimal | validation |
1 | Precision | 0.818182 | blended_baselearner_svm_optimal | validation |
2 | Recall | 0.900000 | blended_baselearner_svm_optimal | validation |
3 | F1 | 0.857143 | blended_baselearner_svm_optimal | validation |
4 | AUROC | 0.909184 | blended_baselearner_svm_optimal | validation |
##################################
# Saving the best individual model
# developed from the train data
##################################
joblib.dump(blended_baselearner_svm_optimal,
os.path.join("..", MODELS_PATH, "blended_model_baselearner_svm_optimal.pkl"))
['..\\models\\blended_model_baselearner_svm_optimal.pkl']
1.10.3 Base Learner - Ridge Classifier ¶
Ridge Classifier is a variation of logistic regression that incorporates L2 regularization to prevent overfitting by penalizing large coefficients in the decision boundary equation. It assumes a linear relationship between the predictor variables and the target class, estimating class probabilities using the logistic function. The key steps include fitting a linear model while adding a penalty term to shrink coefficient values, which reduces variance and improves generalization. Ridge Classifier is particularly useful when dealing with collinear features, as it distributes the importance among correlated variables instead of assigning extreme weights to a few. The advantages of Ridge Classifier include its efficiency, interpretability, and ability to handle high-dimensional data with multicollinearity. However, it has limitations: it assumes a linear decision boundary, making it unsuitable for complex, non-linear relationships, and the regularization parameter requires tuning to balance bias and variance effectively. Additionally, it does not perform feature selection, meaning all input features contribute to the decision-making process, which may reduce interpretability in some cases.
- The ridge classifier model from the sklearn.linear_model Python library API was implemented.
- The model contains 3 hyperparameters for tuning:
- alpha = regularization strength made to vary between 1.0 and 2.0
- solver = solver to use in the computational routines made to vary between sag and saga
- tol = precision of the solution made to vary between 1e-3 and 1e-4
- A special hyperparameter (class_weight = balanced) was fixed to address the minimal 2:1 class imbalance observed between the No and Yes Recurred categories.
- Hyperparameter tuning was conducted using the 5-cycle 5-fold cross-validation method with optimal model performance using the F1 score determined for:
- alpha = 2.0
- solver = saga
- tol = 1e-4
- The apparent model performance of the optimal model is summarized as follows:
- Accuracy = 0.8872
- Precision = 0.7638
- Recall = 0.9016
- F1 Score = 0.8270
- AUROC = 0.8913
- The independent validation model performance of the optimal model is summarized as follows:
- Accuracy = 0.8985
- Precision = 0.7826
- Recall = 0.9000
- F1 Score = 0.8372
- AUROC = 0.8989
- Sufficiently comparable apparent and independent validation model performance observed that might be indicative of the absence of excessive model overfitting.
##################################
# Defining the categorical preprocessing parameters
##################################
categorical_features = ['Gender','Smoking','Physical_Examination','Adenopathy','Focality','Risk','T','Stage','Response']
categorical_transformer = OrdinalEncoder()
categorical_preprocessor = ColumnTransformer(transformers=[
('cat', categorical_transformer, categorical_features)],
remainder='passthrough',
force_int_remainder_cols=False)
##################################
# Defining the preprocessing and modeling pipeline parameters
##################################
blended_baselearner_rc_pipeline = Pipeline([
('categorical_preprocessor', categorical_preprocessor),
('blended_baselearner_rc_model', RidgeClassifier(class_weight='balanced',
random_state=987654321))
])
##################################
# Defining hyperparameter grid
##################################
blended_baselearner_rc_hyperparameter_grid = {
'blended_baselearner_rc_model__alpha': [1.00, 2.00],
'blended_baselearner_rc_model__solver': ['sag', 'saga'],
'blended_baselearner_rc_model__tol': [1e-3, 1e-4]
}
##################################
# Defining the cross-validation strategy (5-cycle 5-fold CV)
##################################
cv_strategy = RepeatedStratifiedKFold(n_splits=5,
n_repeats=5,
random_state=987654321)
##################################
# Performing Grid Search with cross-validation
##################################
blended_baselearner_rc_grid_search = GridSearchCV(
estimator=blended_baselearner_rc_pipeline,
param_grid=blended_baselearner_rc_hyperparameter_grid,
scoring='f1',
cv=cv_strategy,
n_jobs=-1,
verbose=1
)
##################################
# Encoding the response variables
# for model evaluation
##################################
y_encoder = OrdinalEncoder()
y_encoder.fit(y_preprocessed_train.values.reshape(-1, 1))
y_preprocessed_train_encoded = y_encoder.transform(y_preprocessed_train.values.reshape(-1, 1)).ravel()
y_preprocessed_validation_encoded = y_encoder.transform(y_preprocessed_validation.values.reshape(-1, 1)).ravel()
##################################
# Fitting GridSearchCV
##################################
blended_baselearner_rc_grid_search.fit(X_preprocessed_train, y_preprocessed_train_encoded)
Fitting 25 folds for each of 8 candidates, totalling 200 fits
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321), estimator=Pipeline(steps=[('categorical_preprocessor', ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])])), ('blended_baselearner_rc_model', RidgeClassifier(class_weight='balanced', random_state=987654321))]), n_jobs=-1, param_grid={'blended_baselearner_rc_model__alpha': [1.0, 2.0], 'blended_baselearner_rc_model__solver': ['sag', 'saga'], 'blended_baselearner_rc_model__tol': [0.001, 0.0001]}, scoring='f1', verbose=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321), estimator=Pipeline(steps=[('categorical_preprocessor', ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])])), ('blended_baselearner_rc_model', RidgeClassifier(class_weight='balanced', random_state=987654321))]), n_jobs=-1, param_grid={'blended_baselearner_rc_model__alpha': [1.0, 2.0], 'blended_baselearner_rc_model__solver': ['sag', 'saga'], 'blended_baselearner_rc_model__tol': [0.001, 0.0001]}, scoring='f1', verbose=1)
Pipeline(steps=[('categorical_preprocessor', ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])])), ('blended_baselearner_rc_model', RidgeClassifier(alpha=2.0, class_weight='balanced', random_state=987654321, solver='saga'))])
ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])])
['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response']
OrdinalEncoder()
['Age']
passthrough
RidgeClassifier(alpha=2.0, class_weight='balanced', random_state=987654321, solver='saga')
##################################
# Identifying the best model
##################################
blended_baselearner_rc_optimal = blended_baselearner_rc_grid_search.best_estimator_
##################################
# Evaluating the F1 scores
# on the training, cross-validation, and validation data
##################################
blended_baselearner_rc_optimal_f1_cv = blended_baselearner_rc_grid_search.best_score_
blended_baselearner_rc_optimal_f1_train = f1_score(y_preprocessed_train_encoded, blended_baselearner_rc_optimal.predict(X_preprocessed_train))
blended_baselearner_rc_optimal_f1_validation = f1_score(y_preprocessed_validation_encoded, blended_baselearner_rc_optimal.predict(X_preprocessed_validation))
##################################
# Identifying the optimal model
##################################
print('Best Blended Base Learner Ridge Classifier: ')
print(f"Best Blended Base Learner Ridge Classifier Hyperparameters: {blended_baselearner_rc_grid_search.best_params_}")
Best Blended Base Learner Ridge Classifier: Best Blended Base Learner Ridge Classifier Hyperparameters: {'blended_baselearner_rc_model__alpha': 2.0, 'blended_baselearner_rc_model__solver': 'saga', 'blended_baselearner_rc_model__tol': 0.0001}
##################################
# Summarizing the F1 score results
# and classification metrics
# on the training and cross-validated data
# to assess overfitting optimism
##################################
print(f"F1 Score on Cross-Validated Data: {blended_baselearner_rc_optimal_f1_cv:.4f}")
print(f"F1 Score on Training Data: {blended_baselearner_rc_optimal_f1_train:.4f}")
print("\nClassification Report on Train Data:\n", classification_report(y_preprocessed_train_encoded, blended_baselearner_rc_optimal.predict(X_preprocessed_train)))
F1 Score on Cross-Validated Data: 0.8097 F1 Score on Training Data: 0.8271 Classification Report on Train Data: precision recall f1-score support 0.0 0.95 0.88 0.92 143 1.0 0.76 0.90 0.83 61 accuracy 0.89 204 macro avg 0.86 0.89 0.87 204 weighted avg 0.90 0.89 0.89 204
##################################
# Formulating the raw and normalized
# confusion matrices
# from the train data
##################################
cm_raw = confusion_matrix(y_preprocessed_train_encoded, blended_baselearner_rc_optimal.predict(X_preprocessed_train))
cm_normalized = confusion_matrix(y_preprocessed_train_encoded, blended_baselearner_rc_optimal.predict(X_preprocessed_train), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Blended Base Learner Ridge Classifier Train Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Blended Base Learner Ridge Classifier Train Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
# to assess overfitting optimism
##################################
print(f"F1 Score on Validation Data: {blended_baselearner_rc_optimal_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_preprocessed_validation_encoded, blended_baselearner_rc_optimal.predict(X_preprocessed_validation)))
F1 Score on Validation Data: 0.8372 Classification Report on Validation Data: precision recall f1-score support 0.0 0.96 0.90 0.93 49 1.0 0.78 0.90 0.84 20 accuracy 0.90 69 macro avg 0.87 0.90 0.88 69 weighted avg 0.91 0.90 0.90 69
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_preprocessed_validation_encoded, blended_baselearner_rc_optimal.predict(X_preprocessed_validation))
cm_normalized = confusion_matrix(y_preprocessed_validation_encoded, blended_baselearner_rc_optimal.predict(X_preprocessed_validation), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Blended Base Learner Ridge Classifier Validation Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Blended Base Learner Ridge Classifier Validation Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
##################################
# Gathering the model evaluation metrics
# for the train data
##################################
blended_baselearner_rc_optimal_train = model_performance_evaluation(y_preprocessed_train_encoded, blended_baselearner_rc_optimal.predict(X_preprocessed_train))
blended_baselearner_rc_optimal_train['model'] = ['blended_baselearner_rc_optimal'] * 5
blended_baselearner_rc_optimal_train['set'] = ['train'] * 5
print('Optimal Blended Base Learner Ridge Classifier Train Performance Metrics: ')
display(blended_baselearner_rc_optimal_train)
Optimal Blended Base Learner Ridge Classifier Train Performance Metrics:
metric_name | metric_value | model | set | |
---|---|---|---|---|
0 | Accuracy | 0.887255 | blended_baselearner_rc_optimal | train |
1 | Precision | 0.763889 | blended_baselearner_rc_optimal | train |
2 | Recall | 0.901639 | blended_baselearner_rc_optimal | train |
3 | F1 | 0.827068 | blended_baselearner_rc_optimal | train |
4 | AUROC | 0.891379 | blended_baselearner_rc_optimal | train |
##################################
# Gathering the model evaluation metrics
# for the validation data
##################################
blended_baselearner_rc_optimal_validation = model_performance_evaluation(y_preprocessed_validation_encoded, blended_baselearner_rc_optimal.predict(X_preprocessed_validation))
blended_baselearner_rc_optimal_validation['model'] = ['blended_baselearner_rc_optimal'] * 5
blended_baselearner_rc_optimal_validation['set'] = ['validation'] * 5
print('Optimal Blended Base Learner Ridge Classifier Validation Performance Metrics: ')
display(blended_baselearner_rc_optimal_validation)
Optimal Blended Base Learner Ridge Classifier Validation Performance Metrics:
metric_name | metric_value | model | set | |
---|---|---|---|---|
0 | Accuracy | 0.898551 | blended_baselearner_rc_optimal | validation |
1 | Precision | 0.782609 | blended_baselearner_rc_optimal | validation |
2 | Recall | 0.900000 | blended_baselearner_rc_optimal | validation |
3 | F1 | 0.837209 | blended_baselearner_rc_optimal | validation |
4 | AUROC | 0.898980 | blended_baselearner_rc_optimal | validation |
##################################
# Saving the best individual model
# developed from the train data
##################################
joblib.dump(blended_baselearner_rc_optimal,
os.path.join("..", MODELS_PATH, "blended_model_baselearner_ridge_classifier_optimal.pkl"))
['..\\models\\blended_model_baselearner_ridge_classifier_optimal.pkl']
1.10.4 Base Learner - Neural Network ¶
Neural Network is a classification algorithm inspired by the human brain, consisting of layers of interconnected neurons that transform input features through weighted connections and activation functions. It learns patterns in data through backpropagation, where the network adjusts its internal weights to minimize classification error. The process involves an input layer receiving data, multiple hidden layers extracting hierarchical features, and an output layer producing a final prediction. The key advantages of neural networks include their ability to model highly complex, non-linear relationships, making them suitable for image, text, and speech classification tasks. They are also highly scalable, capable of handling massive datasets. However, neural networks have several challenges: they require substantial computational resources, especially for deep architectures; they need large amounts of labeled data for effective training; and they are often difficult to interpret due to their "black box" nature. Additionally, hyperparameter tuning, including choosing the number of layers, neurons, and activation functions, is non-trivial and requires careful optimization to prevent overfitting or underfitting.
- The neural network model from the sklearn.neural_network Python library API was implemented.
- The model contains 3 hyperparameters for tuning:
- hidden_layer_sizes = ith element represents the number of neurons in the ith hidden layer made to vary between (50,) and (100,)
- activation = activation function for the hidden layer made to vary between relu and tanh
- alpha = strength of the L2 regularization term made to vary between 0.0001 and 0.001
- No hyperparameter was defined in the model to address the minimal 2:1 class imbalance observed between the No and Yes Recurred categories.
- Hyperparameter tuning was conducted using the 5-cycle 5-fold cross-validation method with optimal model performance using the F1 score determined for:
- hidden_layer_sizes = (50,)
- activation = relu
- alpha = 0.0001
- The apparent model performance of the optimal model is summarized as follows:
- Accuracy = 0.8921
- Precision = 0.8095
- Recall = 0.8360
- F1 Score = 0.8225
- AUROC = 0.8760
- The independent validation model performance of the optimal model is summarized as follows:
- Accuracy = 0.8840
- Precision = 0.7727
- Recall = 0.8500
- F1 Score = 0.8095
- AUROC = 0.8739
- Sufficiently comparable apparent and independent validation model performance observed that might be indicative of the absence of excessive model overfitting.
##################################
# Defining the categorical preprocessing parameters
##################################
categorical_features = ['Gender','Smoking','Physical_Examination','Adenopathy','Focality','Risk','T','Stage','Response']
categorical_transformer = OrdinalEncoder()
categorical_preprocessor = ColumnTransformer(transformers=[
('cat', categorical_transformer, categorical_features)],
remainder='passthrough',
force_int_remainder_cols=False)
##################################
# Defining the preprocessing and modeling pipeline parameters
##################################
blended_baselearner_nn_pipeline = Pipeline([
('categorical_preprocessor', categorical_preprocessor),
('blended_baselearner_nn_model', MLPClassifier(max_iter=500,
solver='lbfgs',
early_stopping=False,
random_state=987654321))
])
##################################
# Defining hyperparameter grid
##################################
blended_baselearner_nn_hyperparameter_grid = {
'blended_baselearner_nn_model__hidden_layer_sizes': [(50,), (100,)],
'blended_baselearner_nn_model__activation': ['relu', 'tanh'],
'blended_baselearner_nn_model__alpha': [0.0001, 0.001]
}
##################################
# Defining the cross-validation strategy (5-cycle 5-fold CV)
##################################
cv_strategy = RepeatedStratifiedKFold(n_splits=5,
n_repeats=5,
random_state=987654321)
##################################
# Performing Grid Search with cross-validation
##################################
blended_baselearner_nn_grid_search = GridSearchCV(
estimator=blended_baselearner_nn_pipeline,
param_grid=blended_baselearner_nn_hyperparameter_grid,
scoring='f1',
cv=cv_strategy,
n_jobs=-1,
verbose=1
)
##################################
# Encoding the response variables
# for model evaluation
##################################
y_encoder = OrdinalEncoder()
y_encoder.fit(y_preprocessed_train.values.reshape(-1, 1))
y_preprocessed_train_encoded = y_encoder.transform(y_preprocessed_train.values.reshape(-1, 1)).ravel()
y_preprocessed_validation_encoded = y_encoder.transform(y_preprocessed_validation.values.reshape(-1, 1)).ravel()
##################################
# Fitting GridSearchCV
##################################
blended_baselearner_nn_grid_search.fit(X_preprocessed_train, y_preprocessed_train_encoded)
Fitting 25 folds for each of 8 candidates, totalling 200 fits
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321), estimator=Pipeline(steps=[('categorical_preprocessor', ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])])), ('blended_baselearner_nn_model', MLPClassifier(max_iter=500, random_state=987654321, solver='lbfgs'))]), n_jobs=-1, param_grid={'blended_baselearner_nn_model__activation': ['relu', 'tanh'], 'blended_baselearner_nn_model__alpha': [0.0001, 0.001], 'blended_baselearner_nn_model__hidden_layer_sizes': [(50,), (100,)]}, scoring='f1', verbose=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321), estimator=Pipeline(steps=[('categorical_preprocessor', ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])])), ('blended_baselearner_nn_model', MLPClassifier(max_iter=500, random_state=987654321, solver='lbfgs'))]), n_jobs=-1, param_grid={'blended_baselearner_nn_model__activation': ['relu', 'tanh'], 'blended_baselearner_nn_model__alpha': [0.0001, 0.001], 'blended_baselearner_nn_model__hidden_layer_sizes': [(50,), (100,)]}, scoring='f1', verbose=1)
Pipeline(steps=[('categorical_preprocessor', ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])])), ('blended_baselearner_nn_model', MLPClassifier(hidden_layer_sizes=(50,), max_iter=500, random_state=987654321, solver='lbfgs'))])
ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])])
['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response']
OrdinalEncoder()
['Age']
passthrough
MLPClassifier(hidden_layer_sizes=(50,), max_iter=500, random_state=987654321, solver='lbfgs')
##################################
# Identifying the best model
##################################
blended_baselearner_nn_optimal = blended_baselearner_nn_grid_search.best_estimator_
##################################
# Evaluating the F1 scores
# on the training, cross-validation, and validation data
##################################
blended_baselearner_nn_optimal_f1_cv = blended_baselearner_nn_grid_search.best_score_
blended_baselearner_nn_optimal_f1_train = f1_score(y_preprocessed_train_encoded, blended_baselearner_nn_optimal.predict(X_preprocessed_train))
blended_baselearner_nn_optimal_f1_validation = f1_score(y_preprocessed_validation_encoded, blended_baselearner_nn_optimal.predict(X_preprocessed_validation))
##################################
# Identifying the optimal model
##################################
print('Best Blended Base Learner Neural Network: ')
print(f"Best Blended Base Learner Neural Network Hyperparameters: {blended_baselearner_nn_grid_search.best_params_}")
Best Blended Base Learner Neural Network: Best Blended Base Learner Neural Network Hyperparameters: {'blended_baselearner_nn_model__activation': 'relu', 'blended_baselearner_nn_model__alpha': 0.0001, 'blended_baselearner_nn_model__hidden_layer_sizes': (50,)}
##################################
# Summarizing the F1 score results
# and classification metrics
# on the training and cross-validated data
# to assess overfitting optimism
##################################
print(f"F1 Score on Cross-Validated Data: {blended_baselearner_nn_optimal_f1_cv:.4f}")
print(f"F1 Score on Training Data: {blended_baselearner_nn_optimal_f1_train:.4f}")
print("\nClassification Report on Train Data:\n", classification_report(y_preprocessed_train_encoded, blended_baselearner_nn_optimal.predict(X_preprocessed_train)))
F1 Score on Cross-Validated Data: 0.8063 F1 Score on Training Data: 0.8226 Classification Report on Train Data: precision recall f1-score support 0.0 0.93 0.92 0.92 143 1.0 0.81 0.84 0.82 61 accuracy 0.89 204 macro avg 0.87 0.88 0.87 204 weighted avg 0.89 0.89 0.89 204
##################################
# Formulating the raw and normalized
# confusion matrices
# from the train data
##################################
cm_raw = confusion_matrix(y_preprocessed_train_encoded, blended_baselearner_nn_optimal.predict(X_preprocessed_train))
cm_normalized = confusion_matrix(y_preprocessed_train_encoded, blended_baselearner_nn_optimal.predict(X_preprocessed_train), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Blended Base Learner Neural Network Train Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Blended Base Learner Neural Network Train Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
# to assess overfitting optimism
##################################
print(f"F1 Score on Validation Data: {blended_baselearner_nn_optimal_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_preprocessed_validation_encoded, blended_baselearner_nn_optimal.predict(X_preprocessed_validation)))
F1 Score on Validation Data: 0.8095 Classification Report on Validation Data: precision recall f1-score support 0.0 0.94 0.90 0.92 49 1.0 0.77 0.85 0.81 20 accuracy 0.88 69 macro avg 0.85 0.87 0.86 69 weighted avg 0.89 0.88 0.89 69
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_preprocessed_validation_encoded, blended_baselearner_nn_optimal.predict(X_preprocessed_validation))
cm_normalized = confusion_matrix(y_preprocessed_validation_encoded, blended_baselearner_nn_optimal.predict(X_preprocessed_validation), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Blended Base Learner Neural Network Validation Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Blended Base Learner Neural Network Validation Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
##################################
# Gathering the model evaluation metrics
# for the train data
##################################
blended_baselearner_nn_optimal_train = model_performance_evaluation(y_preprocessed_train_encoded, blended_baselearner_nn_optimal.predict(X_preprocessed_train))
blended_baselearner_nn_optimal_train['model'] = ['blended_baselearner_nn_optimal'] * 5
blended_baselearner_nn_optimal_train['set'] = ['train'] * 5
print('Optimal Blended Base Learner Neural Network Train Performance Metrics: ')
display(blended_baselearner_nn_optimal_train)
Optimal Blended Base Learner Neural Network Train Performance Metrics:
metric_name | metric_value | model | set | |
---|---|---|---|---|
0 | Accuracy | 0.892157 | blended_baselearner_nn_optimal | train |
1 | Precision | 0.809524 | blended_baselearner_nn_optimal | train |
2 | Recall | 0.836066 | blended_baselearner_nn_optimal | train |
3 | F1 | 0.822581 | blended_baselearner_nn_optimal | train |
4 | AUROC | 0.876075 | blended_baselearner_nn_optimal | train |
##################################
# Gathering the model evaluation metrics
# for the validation data
##################################
blended_baselearner_nn_optimal_validation = model_performance_evaluation(y_preprocessed_validation_encoded, blended_baselearner_nn_optimal.predict(X_preprocessed_validation))
blended_baselearner_nn_optimal_validation['model'] = ['blended_baselearner_nn_optimal'] * 5
blended_baselearner_nn_optimal_validation['set'] = ['validation'] * 5
print('Optimal Blended Base Learner Neural Network Validation Performance Metrics: ')
display(blended_baselearner_nn_optimal_validation)
Optimal Blended Base Learner Neural Network Validation Performance Metrics:
metric_name | metric_value | model | set | |
---|---|---|---|---|
0 | Accuracy | 0.884058 | blended_baselearner_nn_optimal | validation |
1 | Precision | 0.772727 | blended_baselearner_nn_optimal | validation |
2 | Recall | 0.850000 | blended_baselearner_nn_optimal | validation |
3 | F1 | 0.809524 | blended_baselearner_nn_optimal | validation |
4 | AUROC | 0.873980 | blended_baselearner_nn_optimal | validation |
##################################
# Saving the best individual model
# developed from the train data
##################################
joblib.dump(blended_baselearner_nn_optimal,
os.path.join("..", MODELS_PATH, "blended_model_baselearner_neural_network_optimal.pkl"))
['..\\models\\blended_model_baselearner_neural_network_optimal.pkl']
1.10.5 Base Learner - Decision Tree ¶
Decision Tree is a hierarchical classification model that recursively splits data based on feature values, forming a tree-like structure where each node represents a decision rule and each leaf represents a class label. The tree is built using a greedy algorithm that selects the best feature at each step based on criteria such as information gain or Gini impurity. The main advantages of decision trees include their interpretability, as the decision-making process can be easily visualized and understood, and their ability to model non-linear relationships without requiring extensive feature engineering. They also handle both numerical and categorical data well. However, decision trees are prone to overfitting, especially when deep trees are grown without pruning. Small changes in the dataset can lead to entirely different structures, making them unstable. Additionally, they tend to perform poorly on highly complex problems where relationships between variables are intricate, making ensemble methods such as Random Forest or Gradient Boosting more effective in practice.
- The decision tree model from the sklearn.tree Python library API was implemented.
- The model contains 4 hyperparameters for tuning:
- criterion = function to measure the quality of a split made to vary between gini and entropy
- max_depth = maximum depth of the tree made to vary between 3 and 6
- min_samples_leaf = minimum number of samples required to be at a leaf node made to vary between 5 and 10
- A special hyperparameter (class_weight = balanced) was fixed to address the minimal 2:1 class imbalance observed between the No and Yes Recurred categories.
- Hyperparameter tuning was conducted using the 5-cycle 5-fold cross-validation method with optimal model performance using the F1 score determined for:
- criterion = gini
- max_depth = 6
- min_samples_leaf = 5
- The apparent model performance of the optimal model is summarized as follows:
- Accuracy = 0.8970
- Precision = 0.7500
- Recall = 0.9836
- F1 Score = 0.8510
- AUROC = 0.9218
- The independent validation model performance of the optimal model is summarized as follows:
- Accuracy = 0.8550
- Precision = 0.6666
- Recall = 1.0000
- F1 Score = 0.8000
- AUROC = 0.8979
- Sufficiently comparable apparent and independent validation model performance observed that might be indicative of the absence of excessive model overfitting.
##################################
# Defining the categorical preprocessing parameters
##################################
categorical_features = ['Gender','Smoking','Physical_Examination','Adenopathy','Focality','Risk','T','Stage','Response']
categorical_transformer = OrdinalEncoder()
categorical_preprocessor = ColumnTransformer(transformers=[
('cat', categorical_transformer, categorical_features)],
remainder='passthrough',
force_int_remainder_cols=False)
##################################
# Defining the preprocessing and modeling pipeline parameters
##################################
blended_baselearner_dt_pipeline = Pipeline([
('categorical_preprocessor', categorical_preprocessor),
('blended_baselearner_dt_model', DecisionTreeClassifier(class_weight='balanced',
random_state=987654321))
])
##################################
# Defining hyperparameter grid
##################################
blended_baselearner_dt_hyperparameter_grid = {
'blended_baselearner_dt_model__criterion': ['gini', 'entropy'],
'blended_baselearner_dt_model__max_depth': [3, 6],
'blended_baselearner_dt_model__min_samples_leaf': [5, 10]
}
##################################
# Defining the cross-validation strategy (5-cycle 5-fold CV)
##################################
cv_strategy = RepeatedStratifiedKFold(n_splits=5,
n_repeats=5,
random_state=987654321)
##################################
# Performing Grid Search with cross-validation
##################################
blended_baselearner_dt_grid_search = GridSearchCV(
estimator=blended_baselearner_dt_pipeline,
param_grid=blended_baselearner_dt_hyperparameter_grid,
scoring='f1',
cv=cv_strategy,
n_jobs=-1,
verbose=1
)
##################################
# Encoding the response variables
# for model evaluation
##################################
y_encoder = OrdinalEncoder()
y_encoder.fit(y_preprocessed_train.values.reshape(-1, 1))
y_preprocessed_train_encoded = y_encoder.transform(y_preprocessed_train.values.reshape(-1, 1)).ravel()
y_preprocessed_validation_encoded = y_encoder.transform(y_preprocessed_validation.values.reshape(-1, 1)).ravel()
##################################
# Fitting GridSearchCV
##################################
blended_baselearner_dt_grid_search.fit(X_preprocessed_train, y_preprocessed_train_encoded)
Fitting 25 folds for each of 8 candidates, totalling 200 fits
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321), estimator=Pipeline(steps=[('categorical_preprocessor', ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])])), ('blended_baselearner_dt_model', DecisionTreeClassifier(class_weight='balanced', random_state=987654321))]), n_jobs=-1, param_grid={'blended_baselearner_dt_model__criterion': ['gini', 'entropy'], 'blended_baselearner_dt_model__max_depth': [3, 6], 'blended_baselearner_dt_model__min_samples_leaf': [5, 10]}, scoring='f1', verbose=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321), estimator=Pipeline(steps=[('categorical_preprocessor', ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])])), ('blended_baselearner_dt_model', DecisionTreeClassifier(class_weight='balanced', random_state=987654321))]), n_jobs=-1, param_grid={'blended_baselearner_dt_model__criterion': ['gini', 'entropy'], 'blended_baselearner_dt_model__max_depth': [3, 6], 'blended_baselearner_dt_model__min_samples_leaf': [5, 10]}, scoring='f1', verbose=1)
Pipeline(steps=[('categorical_preprocessor', ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])])), ('blended_baselearner_dt_model', DecisionTreeClassifier(class_weight='balanced', max_depth=6, min_samples_leaf=5, random_state=987654321))])
ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough', transformers=[('cat', OrdinalEncoder(), ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response'])])
['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response']
OrdinalEncoder()
['Age']
passthrough
DecisionTreeClassifier(class_weight='balanced', max_depth=6, min_samples_leaf=5, random_state=987654321)
##################################
# Identifying the best model
##################################
blended_baselearner_dt_optimal = blended_baselearner_dt_grid_search.best_estimator_
##################################
# Evaluating the F1 scores
# on the training, cross-validation, and validation data
##################################
blended_baselearner_dt_optimal_f1_cv = blended_baselearner_dt_grid_search.best_score_
blended_baselearner_dt_optimal_f1_train = f1_score(y_preprocessed_train_encoded, blended_baselearner_dt_optimal.predict(X_preprocessed_train))
blended_baselearner_dt_optimal_f1_validation = f1_score(y_preprocessed_validation_encoded, blended_baselearner_dt_optimal.predict(X_preprocessed_validation))
##################################
# Identifying the optimal model
##################################
print('Best Blended Base Learner Decision Trees: ')
print(f"Best Blended Base Learner Decision Trees Hyperparameters: {blended_baselearner_dt_grid_search.best_params_}")
Best Blended Base Learner Decision Trees: Best Blended Base Learner Decision Trees Hyperparameters: {'blended_baselearner_dt_model__criterion': 'gini', 'blended_baselearner_dt_model__max_depth': 6, 'blended_baselearner_dt_model__min_samples_leaf': 5}
##################################
# Summarizing the F1 score results
# and classification metrics
# on the training and cross-validated data
# to assess overfitting optimism
##################################
print(f"F1 Score on Cross-Validated Data: {blended_baselearner_dt_optimal_f1_cv:.4f}")
print(f"F1 Score on Training Data: {blended_baselearner_dt_optimal_f1_train:.4f}")
print("\nClassification Report on Train Data:\n", classification_report(y_preprocessed_train_encoded, blended_baselearner_dt_optimal.predict(X_preprocessed_train)))
F1 Score on Cross-Validated Data: 0.8099 F1 Score on Training Data: 0.8511 Classification Report on Train Data: precision recall f1-score support 0.0 0.99 0.86 0.92 143 1.0 0.75 0.98 0.85 61 accuracy 0.90 204 macro avg 0.87 0.92 0.89 204 weighted avg 0.92 0.90 0.90 204
##################################
# Formulating the raw and normalized
# confusion matrices
# from the train data
##################################
cm_raw = confusion_matrix(y_preprocessed_train_encoded, blended_baselearner_dt_optimal.predict(X_preprocessed_train))
cm_normalized = confusion_matrix(y_preprocessed_train_encoded, blended_baselearner_dt_optimal.predict(X_preprocessed_train), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Blended Base Learner Decision Trees Train Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Blended Base Learner Decision Trees Train Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
# to assess overfitting optimism
##################################
print(f"F1 Score on Validation Data: {blended_baselearner_dt_optimal_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_preprocessed_validation_encoded, blended_baselearner_dt_optimal.predict(X_preprocessed_validation)))
F1 Score on Validation Data: 0.8000 Classification Report on Validation Data: precision recall f1-score support 0.0 1.00 0.80 0.89 49 1.0 0.67 1.00 0.80 20 accuracy 0.86 69 macro avg 0.83 0.90 0.84 69 weighted avg 0.90 0.86 0.86 69
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_preprocessed_validation_encoded, blended_baselearner_dt_optimal.predict(X_preprocessed_validation))
cm_normalized = confusion_matrix(y_preprocessed_validation_encoded, blended_baselearner_dt_optimal.predict(X_preprocessed_validation), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Blended Base Learner Decision Trees Validation Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Blended Base Learner Decision Trees Validation Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
##################################
# Gathering the model evaluation metrics
# for the train data
##################################
blended_baselearner_dt_optimal_train = model_performance_evaluation(y_preprocessed_train_encoded, blended_baselearner_dt_optimal.predict(X_preprocessed_train))
blended_baselearner_dt_optimal_train['model'] = ['blended_baselearner_dt_optimal'] * 5
blended_baselearner_dt_optimal_train['set'] = ['train'] * 5
print('Optimal Blended Base Learner Decision Tree Train Performance Metrics: ')
display(blended_baselearner_dt_optimal_train)
Optimal Blended Base Learner Decision Tree Train Performance Metrics:
metric_name | metric_value | model | set | |
---|---|---|---|---|
0 | Accuracy | 0.897059 | blended_baselearner_dt_optimal | train |
1 | Precision | 0.750000 | blended_baselearner_dt_optimal | train |
2 | Recall | 0.983607 | blended_baselearner_dt_optimal | train |
3 | F1 | 0.851064 | blended_baselearner_dt_optimal | train |
4 | AUROC | 0.921873 | blended_baselearner_dt_optimal | train |
##################################
# Gathering the model evaluation metrics
# for the validation data
##################################
blended_baselearner_dt_optimal_validation = model_performance_evaluation(y_preprocessed_validation_encoded, blended_baselearner_dt_optimal.predict(X_preprocessed_validation))
blended_baselearner_dt_optimal_validation['model'] = ['blended_baselearner_dt_optimal'] * 5
blended_baselearner_dt_optimal_validation['set'] = ['validation'] * 5
print('Optimal Blended Base Learner Decision Tree Validation Performance Metrics: ')
display(blended_baselearner_dt_optimal_validation)
Optimal Blended Base Learner Decision Tree Validation Performance Metrics:
metric_name | metric_value | model | set | |
---|---|---|---|---|
0 | Accuracy | 0.855072 | blended_baselearner_dt_optimal | validation |
1 | Precision | 0.666667 | blended_baselearner_dt_optimal | validation |
2 | Recall | 1.000000 | blended_baselearner_dt_optimal | validation |
3 | F1 | 0.800000 | blended_baselearner_dt_optimal | validation |
4 | AUROC | 0.897959 | blended_baselearner_dt_optimal | validation |
##################################
# Saving the best individual model
# developed from the train data
##################################
joblib.dump(blended_baselearner_dt_optimal,
os.path.join("..", MODELS_PATH, "blended_model_baselearner_decision_trees_optimal.pkl"))
['..\\models\\blended_model_baselearner_decision_trees_optimal.pkl']
1.10.6 Meta Learner - Logistic Regression ¶
Logistic Regression is a linear classification algorithm that estimates the probability of a binary outcome using the logistic (sigmoid) function. It assumes a linear relationship between the predictor variables and the log-odds of the target class. The algorithm involves calculating a weighted sum of input features, applying the sigmoid function to transform the result into a probability, and assigning a class label based on a threshold (typically 0.5). Logistic regression is simple, interpretable, and computationally efficient, making it a popular choice for baseline models and problems where relationships between features and the target variable are approximately linear. It also provides insight into feature importance through its learned coefficients. However, logistic regression has limitations: it struggles with non-linear relationships unless feature engineering or polynomial terms are used, it is sensitive to multicollinearity, and it assumes independence between predictor variables, which may not always hold in real-world data. Additionally, it may perform poorly when classes are highly imbalanced, requiring techniques such as weighting or resampling to improve predictions.
- The logistic regression model from the sklearn.linear_model Python library API was implemented.
- The model contains 3 fixed hyperparameters:
- C = inverse of regularization strength held constant at a value of 1.0
- penalty = penalty norm held constant at a value of l2
- solver = algorithm used in the optimization problem held constant at a value of lbfgs
- A special hyperparameter (class_weight = balanced) was fixed to address the minimal 2:1 class imbalance observed between the No and Yes Recurred categories.
- The apparent model performance of the optimal model is summarized as follows:
- Accuracy = 0.9068
- Precision = 0.8000
- Recall = 0.9180
- F1 Score = 0.8549
- AUROC = 0.9100
- The independent validation model performance of the optimal model is summarized as follows:
- Accuracy = 0.9275
- Precision = 0.8260
- Recall = 0.9500
- F1 Score = 0.8837
- AUROC = 0.9341
- Sufficiently comparable apparent and independent validation model performance observed that might be indicative of the absence of excessive model overfitting.
##################################
# Defining the blending strategy (75-25 development-holdout split)
##################################
X_preprocessed_train_development, X_preprocessed_holdout, y_preprocessed_train_development, y_preprocessed_holdout = train_test_split(
X_preprocessed_train, y_preprocessed_train_encoded,
test_size=0.25,
random_state=987654321
)
##################################
# Loading the pre-trained base learners
# from the previously saved pickle files
##################################
blended_baselearners = {}
blended_baselearner_model = ['knn', 'svm', 'ridge_classifier', 'neural_network', 'decision_trees']
for name in blended_baselearner_model:
blended_baselearner_model_path = os.path.join("..", MODELS_PATH, f"blended_model_baselearner_{name}_optimal.pkl")
blended_baselearners[name] = joblib.load(blended_baselearner_model_path)
##################################
# Initializing the meta-feature matrices
##################################
meta_train_blended = np.zeros((X_preprocessed_holdout.shape[0], len(blended_baselearners)))
meta_validation_blended = np.zeros((X_preprocessed_validation.shape[0], len(blended_baselearners)))
##################################
# Generating hold-out predictions for training the meta learner
##################################
for i, (name, model) in enumerate(blended_baselearners.items()):
model.fit(X_preprocessed_train_development, y_preprocessed_train_development)
meta_train_blended[:, i] = model.predict_proba(X_preprocessed_holdout)[:, 1] if hasattr(model, "predict_proba") else model.predict(X_preprocessed_holdout)
meta_validation_blended[:, i] = model.predict_proba(X_preprocessed_validation)[:, 1] if hasattr(model, "predict_proba") else model.predict(X_preprocessed_validation)
##################################
# Training the meta learner on the stacked features
##################################
blended_metalearner_lr_optimal = LogisticRegression(class_weight='balanced',
penalty='l2',
C=1.0,
solver='lbfgs',
random_state=987654321)
blended_metalearner_lr_optimal.fit(meta_train_blended, y_preprocessed_holdout)
LogisticRegression(class_weight='balanced', random_state=987654321)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(class_weight='balanced', random_state=987654321)
##################################
# Saving the meta learner model
# developed from the meta-train data
##################################
joblib.dump(blended_metalearner_lr_optimal,
os.path.join("..", MODELS_PATH, "blended_model_metalearner_logistic_regression_optimal.pkl"))
['..\\models\\blended_model_metalearner_logistic_regression_optimal.pkl']
##################################
# Creating a function to extract the
# meta-feature matrices for new data
##################################
def extract_blended_metafeature_matrix(X_preprocessed_new):
##################################
# Loading the pre-trained base learners
# from the previously saved pickle files
##################################
blended_baselearners = {}
blended_baselearner_model = ['knn', 'svm', 'ridge_classifier', 'neural_network', 'decision_trees']
for name in blended_baselearner_model:
blended_baselearner_model_path = (os.path.join("..", MODELS_PATH, f"blended_model_baselearner_{name}_optimal.pkl"))
blended_baselearners[name] = joblib.load(blended_baselearner_model_path)
##################################
# Generating meta-features for new data
##################################
meta_train_blended = np.zeros((X_preprocessed_holdout.shape[0], len(blended_baselearners)))
meta_new_blended = np.zeros((X_preprocessed_new.shape[0], len(blended_baselearners)))
##################################
# Generating holdout predictions
# from the base learners
##################################
for i, (name, model) in enumerate(blended_baselearners.items()):
model.fit(X_preprocessed_train_development, y_preprocessed_train_development)
meta_train_blended[:, i] = model.predict_proba(X_preprocessed_holdout)[:, 1] if hasattr(model, "predict_proba") else model.predict(X_preprocessed_holdout)
meta_new_blended[:, i] = model.predict_proba(X_preprocessed_new)[:, 1] if hasattr(model, "predict_proba") else model.predict(X_preprocessed_new)
return meta_new_blended
##################################
# Evaluating the F1 scores
# on the training and validation data
##################################
blended_metalearner_lr_optimal_f1_train = f1_score(y_preprocessed_train_encoded, blended_metalearner_lr_optimal.predict(extract_blended_metafeature_matrix(X_preprocessed_train)))
blended_metalearner_lr_optimal_f1_validation = f1_score(y_preprocessed_validation_encoded, blended_metalearner_lr_optimal.predict(extract_blended_metafeature_matrix(X_preprocessed_validation)))
##################################
# Summarizing the F1 score results
# and classification metrics
# on the training data
# to assess overfitting optimism
##################################
print(f"F1 Score on Training Data: {blended_metalearner_lr_optimal_f1_train:.4f}")
print("\nClassification Report on Train Data:\n", classification_report(y_preprocessed_train_encoded, blended_metalearner_lr_optimal.predict(extract_blended_metafeature_matrix(X_preprocessed_train))))
F1 Score on Training Data: 0.8550 Classification Report on Train Data: precision recall f1-score support 0.0 0.96 0.90 0.93 143 1.0 0.80 0.92 0.85 61 accuracy 0.91 204 macro avg 0.88 0.91 0.89 204 weighted avg 0.91 0.91 0.91 204
##################################
# Formulating the raw and normalized
# confusion matrices
# from the train data
##################################
cm_raw = confusion_matrix(y_preprocessed_train_encoded, blended_metalearner_lr_optimal.predict(extract_blended_metafeature_matrix(X_preprocessed_train)))
cm_normalized = confusion_matrix(y_preprocessed_train_encoded, blended_metalearner_lr_optimal.predict(extract_blended_metafeature_matrix(X_preprocessed_train)), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Blended Meta Learner Logistic Regression Train Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Blended Meta Learner Logistic Regression Train Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
# to assess overfitting optimism
##################################
print(f"F1 Score on Validationing Data: {blended_metalearner_lr_optimal_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_preprocessed_validation_encoded, blended_metalearner_lr_optimal.predict(extract_blended_metafeature_matrix(X_preprocessed_validation))))
F1 Score on Validationing Data: 0.8837 Classification Report on Validation Data: precision recall f1-score support 0.0 0.98 0.92 0.95 49 1.0 0.83 0.95 0.88 20 accuracy 0.93 69 macro avg 0.90 0.93 0.92 69 weighted avg 0.93 0.93 0.93 69
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_preprocessed_validation_encoded, blended_metalearner_lr_optimal.predict(extract_blended_metafeature_matrix(X_preprocessed_validation)))
cm_normalized = confusion_matrix(y_preprocessed_validation_encoded, blended_metalearner_lr_optimal.predict(extract_blended_metafeature_matrix(X_preprocessed_validation)), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Blended Meta Learner Logistic Regression Validation Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Blended Meta Learner Logistic Regression Validation Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
##################################
# Gathering the model evaluation metrics
# for the train data
##################################
blended_metalearner_lr_optimal_train = model_performance_evaluation(y_preprocessed_train_encoded, blended_metalearner_lr_optimal.predict(extract_blended_metafeature_matrix(X_preprocessed_train)))
blended_metalearner_lr_optimal_train['model'] = ['blended_metalearner_lr_optimal'] * 5
blended_metalearner_lr_optimal_train['set'] = ['train'] * 5
print('Optimal Blended Meta Learner Logistic Regression Train Performance Metrics: ')
display(blended_metalearner_lr_optimal_train)
Optimal Blended Meta Learner Logistic Regression Train Performance Metrics:
metric_name | metric_value | model | set | |
---|---|---|---|---|
0 | Accuracy | 0.906863 | blended_metalearner_lr_optimal | train |
1 | Precision | 0.800000 | blended_metalearner_lr_optimal | train |
2 | Recall | 0.918033 | blended_metalearner_lr_optimal | train |
3 | F1 | 0.854962 | blended_metalearner_lr_optimal | train |
4 | AUROC | 0.910065 | blended_metalearner_lr_optimal | train |
##################################
# Gathering the model evaluation metrics
# for the validation data
##################################
blended_metalearner_lr_optimal_validation = model_performance_evaluation(y_preprocessed_validation_encoded, blended_metalearner_lr_optimal.predict(extract_blended_metafeature_matrix(X_preprocessed_validation)))
blended_metalearner_lr_optimal_validation['model'] = ['blended_metalearner_lr_optimal'] * 5
blended_metalearner_lr_optimal_validation['set'] = ['validation'] * 5
print('Optimal Blended Meta Learner Logistic Regression Validation Performance Metrics: ')
display(blended_metalearner_lr_optimal_validation)
Optimal Blended Meta Learner Logistic Regression Validation Performance Metrics:
metric_name | metric_value | model | set | |
---|---|---|---|---|
0 | Accuracy | 0.927536 | blended_metalearner_lr_optimal | validation |
1 | Precision | 0.826087 | blended_metalearner_lr_optimal | validation |
2 | Recall | 0.950000 | blended_metalearner_lr_optimal | validation |
3 | F1 | 0.883721 | blended_metalearner_lr_optimal | validation |
4 | AUROC | 0.934184 | blended_metalearner_lr_optimal | validation |
1.11. Consolidated Summary¶
- Among 12 candidate models, the Blended Model developed from training a Meta Learner by combining predictions from multiple Base Learners was selected as the final model by demonstrating the best F1 Score for the independent validation data with minimal overfitting :
- Apparent F1 Score Performance = 0.8549
- Independent Validation F1 Score Performance = 0.8837
- Independent Test F1 Score Performance = 0.8571
- The final model similarly demonstrated consistently high F1 Score for the test data :
- Independent Test F1 Score Performance = 0.8571
- The final model configuration is described as follows:
- Base Learner: k-nearest neighbors with optimal hyperparameters:
- n_neighbors = 3
- weights = uniform
- metric = minkowski
- Base Learner: support vector machine with optimal hyperparameters:
- C = 1.0
- kernel = linear
- gamma = scale
- Base Learner: ridge classifier with optimal hyperparameters:
- alpha = 2.0
- solver = saga
- tol = 1e-4
- Base Learner: neural network with optimal hyperparameters:
- hidden_layer_sizes = (50,)
- activation = relu
- alpha = 0.0001
- Base Learner: decision tree with optimal hyperparameters:
- criterion = gini
- max_depth = 6
- min_samples_leaf = 5
- Meta Learner: logistic regression model with optimal hyperparameters:
- C = 1.0
- penalty = l2
- solver = lbfgs
- Base Learner: k-nearest neighbors with optimal hyperparameters:
- Only 2 of the 5 base learners demonstrated a significant contribution to the final prediction with positive values noted in terms of the permutation-based importance:
- Base Learner: ridge classifier
- Base Learner: support vector machine
- The remaining 3 base learners have not demonstrated significant contribution to the final prediction with negative values noted in terms of the permutation-based importance
- Base Learner: decision tree
- Base Learner: k-nearest neighbors
- Base Learner: neural network
- For each of the significantly contributing base learners, the predictors with positive permutation-based importance are given as follows:
- Base Learner: ridge classifier
- Age
- T
- Focality
- Smoking
- Response
- Base Learner: support vector machine
- Age
- T
- Base Learner: ridge classifier
##################################
# Consolidating all the
# bagged, boosted, stacked and blended
# model performance measures
# for the train and validation data
##################################
ensemble_train_validation_all_performance = pd.concat([bagged_rf_optimal_train,
bagged_rf_optimal_validation,
bagged_et_optimal_train,
bagged_et_optimal_validation,
bagged_bdt_optimal_train,
bagged_bdt_optimal_validation,
bagged_blr_optimal_train,
bagged_blr_optimal_validation,
bagged_bsvm_optimal_train,
bagged_bsvm_optimal_validation,
boosted_ab_optimal_train,
boosted_ab_optimal_validation,
boosted_gb_optimal_train,
boosted_gb_optimal_validation,
boosted_xgb_optimal_train,
boosted_xgb_optimal_validation,
boosted_lgbm_optimal_train,
boosted_lgbm_optimal_validation,
boosted_cb_optimal_train,
boosted_cb_optimal_validation,
stacked_baselearner_knn_optimal_train,
stacked_baselearner_knn_optimal_validation,
stacked_baselearner_svm_optimal_train,
stacked_baselearner_svm_optimal_validation,
stacked_baselearner_rc_optimal_train,
stacked_baselearner_rc_optimal_validation,
stacked_baselearner_nn_optimal_train,
stacked_baselearner_nn_optimal_validation,
stacked_baselearner_dt_optimal_train,
stacked_baselearner_dt_optimal_validation,
stacked_metalearner_lr_optimal_train,
stacked_metalearner_lr_optimal_validation,
blended_baselearner_knn_optimal_train,
blended_baselearner_knn_optimal_validation,
blended_baselearner_svm_optimal_train,
blended_baselearner_svm_optimal_validation,
blended_baselearner_rc_optimal_train,
blended_baselearner_rc_optimal_validation,
blended_baselearner_nn_optimal_train,
blended_baselearner_nn_optimal_validation,
blended_baselearner_dt_optimal_train,
blended_baselearner_dt_optimal_validation,
blended_metalearner_lr_optimal_train,
blended_metalearner_lr_optimal_validation],
ignore_index=True)
print('Consolidated Ensemble Model Performance on Train and Validation Data: ')
display(ensemble_train_validation_all_performance)
Consolidated Ensemble Model Performance on Train and Validation Data:
metric_name | metric_value | model | set | |
---|---|---|---|---|
0 | Accuracy | 0.892157 | bagged_rf_optimal | train |
1 | Precision | 0.774648 | bagged_rf_optimal | train |
2 | Recall | 0.901639 | bagged_rf_optimal | train |
3 | F1 | 0.833333 | bagged_rf_optimal | train |
4 | AUROC | 0.894876 | bagged_rf_optimal | train |
... | ... | ... | ... | ... |
215 | Accuracy | 0.927536 | blended_metalearner_lr_optimal | validation |
216 | Precision | 0.826087 | blended_metalearner_lr_optimal | validation |
217 | Recall | 0.950000 | blended_metalearner_lr_optimal | validation |
218 | F1 | 0.883721 | blended_metalearner_lr_optimal | validation |
219 | AUROC | 0.934184 | blended_metalearner_lr_optimal | validation |
220 rows × 4 columns
##################################
# Consolidating all the F1 score
# model performance measures
# between the train and validation data
##################################
ensemble_train_validation_all_performance_F1 = ensemble_train_validation_all_performance[ensemble_train_validation_all_performance['metric_name']=='F1']
ensemble_train_validation_all_performance_F1_train = ensemble_train_validation_all_performance_F1[ensemble_train_validation_all_performance_F1['set']=='train'].loc[:,"metric_value"]
ensemble_train_validation_all_performance_F1_validation = ensemble_train_validation_all_performance_F1[ensemble_train_validation_all_performance_F1['set']=='validation'].loc[:,"metric_value"]
##################################
# Combining all the F1 score
# model performance measures
# between the train and validation data
##################################
ensemble_train_validation_all_performance_F1_plot = pd.DataFrame({'train': ensemble_train_validation_all_performance_F1_train.values,
'validation': ensemble_train_validation_all_performance_F1_validation.values},
index=ensemble_train_validation_all_performance_F1['model'].unique())
ensemble_train_validation_all_performance_F1_plot
train | validation | |
---|---|---|
bagged_rf_optimal | 0.833333 | 0.837209 |
bagged_et_optimal | 0.833333 | 0.837209 |
bagged_bdt_optimal | 0.846154 | 0.857143 |
bagged_blr_optimal | 0.833333 | 0.837209 |
bagged_bsvm_optimal | 0.852713 | 0.857143 |
boosted_ab_optimal | 0.843750 | 0.857143 |
boosted_gb_optimal | 0.910569 | 0.829268 |
boosted_xgb_optimal | 0.850394 | 0.857143 |
boosted_lgbm_optimal | 0.894309 | 0.820513 |
boosted_cb_optimal | 0.843750 | 0.857143 |
stacked_baselearner_knn_optimal | 0.862069 | 0.648649 |
stacked_baselearner_svm_optimal | 0.843750 | 0.857143 |
stacked_baselearner_rc_optimal | 0.827068 | 0.837209 |
stacked_baselearner_nn_optimal | 0.822581 | 0.809524 |
stacked_baselearner_dt_optimal | 0.851064 | 0.800000 |
stacked_metalearner_lr_optimal | 0.852713 | 0.857143 |
blended_baselearner_knn_optimal | 0.862069 | 0.648649 |
blended_baselearner_svm_optimal | 0.843750 | 0.857143 |
blended_baselearner_rc_optimal | 0.827068 | 0.837209 |
blended_baselearner_nn_optimal | 0.822581 | 0.809524 |
blended_baselearner_dt_optimal | 0.851064 | 0.800000 |
blended_metalearner_lr_optimal | 0.854962 | 0.883721 |
##################################
# Plotting all the F1 score
# model performance measures
# between the train and validation sets
##################################
ensemble_train_validation_all_performance_F1_plot = ensemble_train_validation_all_performance_F1_plot.plot.barh(figsize=(10, 20), width=0.9)
ensemble_train_validation_all_performance_F1_plot.set_xlim(0.00,1.00)
ensemble_train_validation_all_performance_F1_plot.set_title("Model Comparison by F1 Score Performance on Train and Validation Data")
ensemble_train_validation_all_performance_F1_plot.set_xlabel("F1 Score Performance")
ensemble_train_validation_all_performance_F1_plot.set_ylabel("Ensemble Model")
ensemble_train_validation_all_performance_F1_plot.grid(False)
ensemble_train_validation_all_performance_F1_plot.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
for container in ensemble_train_validation_all_performance_F1_plot.containers:
ensemble_train_validation_all_performance_F1_plot.bar_label(container, fmt='%.5f', padding=-50, color='white', fontweight='bold')
##################################
# Consolidating all the final
# bagged, boosted, stacked and blended
# model performance measures
# for the train and validation data
##################################
ensemble_train_validation_performance = ensemble_train_validation_all_performance[
~ensemble_train_validation_all_performance['model'].str.contains('baselearner', case=False, na=False)
]
print('Consolidated Final Ensemble Model Performance on Train and Validation Data: ')
display(ensemble_train_validation_performance)
Consolidated Final Ensemble Model Performance on Train and Validation Data:
metric_name | metric_value | model | set | |
---|---|---|---|---|
0 | Accuracy | 0.892157 | bagged_rf_optimal | train |
1 | Precision | 0.774648 | bagged_rf_optimal | train |
2 | Recall | 0.901639 | bagged_rf_optimal | train |
3 | F1 | 0.833333 | bagged_rf_optimal | train |
4 | AUROC | 0.894876 | bagged_rf_optimal | train |
... | ... | ... | ... | ... |
215 | Accuracy | 0.927536 | blended_metalearner_lr_optimal | validation |
216 | Precision | 0.826087 | blended_metalearner_lr_optimal | validation |
217 | Recall | 0.950000 | blended_metalearner_lr_optimal | validation |
218 | F1 | 0.883721 | blended_metalearner_lr_optimal | validation |
219 | AUROC | 0.934184 | blended_metalearner_lr_optimal | validation |
120 rows × 4 columns
##################################
# Consolidating all the F1 score
# model performance measures
# between the train and validation data
##################################
ensemble_train_validation_performance_F1 = ensemble_train_validation_performance[ensemble_train_validation_performance['metric_name']=='F1']
ensemble_train_validation_performance_F1_train = ensemble_train_validation_performance_F1[ensemble_train_validation_performance_F1['set']=='train'].loc[:,"metric_value"]
ensemble_train_validation_performance_F1_validation = ensemble_train_validation_performance_F1[ensemble_train_validation_performance_F1['set']=='validation'].loc[:,"metric_value"]
##################################
# Combining all the F1 score
# model performance measures
# between the train and validation data
##################################
ensemble_train_validation_performance_F1_plot = pd.DataFrame({'train': ensemble_train_validation_performance_F1_train.values,
'validation': ensemble_train_validation_performance_F1_validation.values},
index=ensemble_train_validation_performance_F1['model'].unique())
ensemble_train_validation_performance_F1_plot
train | validation | |
---|---|---|
bagged_rf_optimal | 0.833333 | 0.837209 |
bagged_et_optimal | 0.833333 | 0.837209 |
bagged_bdt_optimal | 0.846154 | 0.857143 |
bagged_blr_optimal | 0.833333 | 0.837209 |
bagged_bsvm_optimal | 0.852713 | 0.857143 |
boosted_ab_optimal | 0.843750 | 0.857143 |
boosted_gb_optimal | 0.910569 | 0.829268 |
boosted_xgb_optimal | 0.850394 | 0.857143 |
boosted_lgbm_optimal | 0.894309 | 0.820513 |
boosted_cb_optimal | 0.843750 | 0.857143 |
stacked_metalearner_lr_optimal | 0.852713 | 0.857143 |
blended_metalearner_lr_optimal | 0.854962 | 0.883721 |
##################################
# Plotting all the F1 score
# model performance measures
# between the train and validation sets
##################################
ensemble_train_validation_performance_F1_plot = ensemble_train_validation_performance_F1_plot.plot.barh(figsize=(10, 10), width=0.9)
ensemble_train_validation_performance_F1_plot.set_xlim(0.00,1.00)
ensemble_train_validation_performance_F1_plot.set_title("Model Comparison by F1 Score Performance on Train and Validation Data")
ensemble_train_validation_performance_F1_plot.set_xlabel("F1 Score Performance")
ensemble_train_validation_performance_F1_plot.set_ylabel("Ensemble Model")
ensemble_train_validation_performance_F1_plot.grid(False)
ensemble_train_validation_performance_F1_plot.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
for container in ensemble_train_validation_performance_F1_plot.containers:
ensemble_train_validation_performance_F1_plot.bar_label(container, fmt='%.5f', padding=-50, color='white', fontweight='bold')
##################################
# Gathering all model performance measures
# for the validation data
##################################
ensemble_train_validation_performance_Accuracy_validation = ensemble_train_validation_performance[(ensemble_train_validation_performance['set']=='validation') & (ensemble_train_validation_performance['metric_name']=='Accuracy')].loc[:,"metric_value"]
ensemble_train_validation_performance_Precision_validation = ensemble_train_validation_performance[(ensemble_train_validation_performance['set']=='validation') & (ensemble_train_validation_performance['metric_name']=='Precision')].loc[:,"metric_value"]
ensemble_train_validation_performance_Recall_validation = ensemble_train_validation_performance[(ensemble_train_validation_performance['set']=='validation') & (ensemble_train_validation_performance['metric_name']=='Recall')].loc[:,"metric_value"]
ensemble_train_validation_performance_F1_validation = ensemble_train_validation_performance[(ensemble_train_validation_performance['set']=='validation') & (ensemble_train_validation_performance['metric_name']=='F1')].loc[:,"metric_value"]
ensemble_train_validation_performance_AUROC_validation = ensemble_train_validation_performance[(ensemble_train_validation_performance['set']=='validation') & (ensemble_train_validation_performance['metric_name']=='AUROC')].loc[:,"metric_value"]
##################################
# Combining all the model performance measures
# for the validation data
##################################
ensemble_train_validation_performance_all_plot_validation = pd.DataFrame({'accuracy': ensemble_train_validation_performance_Accuracy_validation.values,
'precision': ensemble_train_validation_performance_Precision_validation.values,
'recall': ensemble_train_validation_performance_Recall_validation.values,
'f1': ensemble_train_validation_performance_F1_validation.values,
'auroc': ensemble_train_validation_performance_AUROC_validation.values},
index=ensemble_train_validation_performance['model'].unique())
ensemble_train_validation_performance_all_plot_validation
accuracy | precision | recall | f1 | auroc | |
---|---|---|---|---|---|
bagged_rf_optimal | 0.898551 | 0.782609 | 0.90 | 0.837209 | 0.898980 |
bagged_et_optimal | 0.898551 | 0.782609 | 0.90 | 0.837209 | 0.898980 |
bagged_bdt_optimal | 0.913043 | 0.818182 | 0.90 | 0.857143 | 0.909184 |
bagged_blr_optimal | 0.898551 | 0.782609 | 0.90 | 0.837209 | 0.898980 |
bagged_bsvm_optimal | 0.913043 | 0.818182 | 0.90 | 0.857143 | 0.909184 |
boosted_ab_optimal | 0.913043 | 0.818182 | 0.90 | 0.857143 | 0.909184 |
boosted_gb_optimal | 0.898551 | 0.809524 | 0.85 | 0.829268 | 0.884184 |
boosted_xgb_optimal | 0.913043 | 0.818182 | 0.90 | 0.857143 | 0.909184 |
boosted_lgbm_optimal | 0.898551 | 0.842105 | 0.80 | 0.820513 | 0.869388 |
boosted_cb_optimal | 0.913043 | 0.818182 | 0.90 | 0.857143 | 0.909184 |
stacked_metalearner_lr_optimal | 0.913043 | 0.818182 | 0.90 | 0.857143 | 0.909184 |
blended_metalearner_lr_optimal | 0.927536 | 0.826087 | 0.95 | 0.883721 | 0.934184 |
##################################
# Gathering the model evaluation metrics
# for the test data
##################################
##################################
# Defining a dictionary of models and
# their corresponding feature extraction functions
##################################
models = {
'bagged_rf_optimal': bagged_rf_optimal,
'bagged_et_optimal': bagged_et_optimal,
'bagged_bdt_optimal': bagged_bdt_optimal,
'bagged_blr_optimal': bagged_blr_optimal,
'bagged_bsvm_optimal': bagged_bsvm_optimal,
'boosted_ab_optimal': boosted_ab_optimal,
'boosted_gb_optimal': boosted_gb_optimal,
'boosted_xgb_optimal': boosted_xgb_optimal,
'boosted_lgbm_optimal': boosted_lgbm_optimal,
'boosted_cb_optimal': boosted_cb_optimal,
'stacked_baselearner_knn_optimal': stacked_baselearner_knn_optimal,
'stacked_baselearner_svm_optimal': stacked_baselearner_svm_optimal,
'stacked_baselearner_rc_optimal': stacked_baselearner_rc_optimal,
'stacked_baselearner_nn_optimal': stacked_baselearner_nn_optimal,
'stacked_baselearner_dt_optimal': stacked_baselearner_dt_optimal,
'stacked_metalearner_lr_optimal': stacked_metalearner_lr_optimal,
'blended_baselearner_knn_optimal': blended_baselearner_knn_optimal,
'blended_baselearner_svm_optimal': blended_baselearner_svm_optimal,
'blended_baselearner_rc_optimal': blended_baselearner_rc_optimal,
'blended_baselearner_nn_optimal': blended_baselearner_nn_optimal,
'blended_baselearner_dt_optimal': blended_baselearner_dt_optimal,
'blended_metalearner_lr_optimal': blended_metalearner_lr_optimal
}
##################################
# Defining transformation functions for meta-learners
##################################
feature_extractors = {
'stacked_metalearner_lr_optimal': extract_stacked_metafeature_matrix,
'blended_metalearner_lr_optimal': extract_blended_metafeature_matrix
}
##################################
# Encoding the response variables
# for the test data
##################################
y_preprocessed_test_encoded = y_encoder.transform(y_preprocessed_test.values.reshape(-1, 1)).ravel()
##################################
# Storing the model evaluation metrics
# for the test data
##################################
ensemble_test_all_performance = []
##################################
# Looping through each model
# and evaluate performance on test data
##################################
for model_name, model in models.items():
# Applying transformation if needed (for meta-learner)
X_input = feature_extractors.get(model_name, lambda x: x)(X_preprocessed_test)
# Evaluating performance
ensemble_test_all_performance_results = model_performance_evaluation(y_preprocessed_test_encoded, model.predict(X_input))
# Adding metadata columns
ensemble_test_all_performance_results['model'] = model_name
ensemble_test_all_performance_results['set'] = 'test'
# Storing result
ensemble_test_all_performance.append(ensemble_test_all_performance_results)
##################################
# Consolidating all model performance measures
# for the test data
##################################
ensemble_test_all_performance = pd.concat(ensemble_test_all_performance, ignore_index=True)
print('Consolidated Ensemble Model Performance on Test Data: ')
display(ensemble_test_all_performance)
Consolidated Ensemble Model Performance on Test Data:
metric_name | metric_value | model | set | |
---|---|---|---|---|
0 | Accuracy | 0.901099 | bagged_rf_optimal | test |
1 | Precision | 0.821429 | bagged_rf_optimal | test |
2 | Recall | 0.851852 | bagged_rf_optimal | test |
3 | F1 | 0.836364 | bagged_rf_optimal | test |
4 | AUROC | 0.886863 | bagged_rf_optimal | test |
... | ... | ... | ... | ... |
105 | Accuracy | 0.912088 | blended_metalearner_lr_optimal | test |
106 | Precision | 0.827586 | blended_metalearner_lr_optimal | test |
107 | Recall | 0.888889 | blended_metalearner_lr_optimal | test |
108 | F1 | 0.857143 | blended_metalearner_lr_optimal | test |
109 | AUROC | 0.905382 | blended_metalearner_lr_optimal | test |
110 rows × 4 columns
##################################
# Consolidating all the final
# bagged, boosted, stacked and blended
# model performance measures
# for the test data
##################################
ensemble_test_performance = ensemble_test_all_performance[
~ensemble_test_all_performance['model'].str.contains('baselearner', case=False, na=False)
]
print('Consolidated Final Ensemble Model Performance on Test Data: ')
display(ensemble_test_performance)
Consolidated Final Ensemble Model Performance on Test Data:
metric_name | metric_value | model | set | |
---|---|---|---|---|
0 | Accuracy | 0.901099 | bagged_rf_optimal | test |
1 | Precision | 0.821429 | bagged_rf_optimal | test |
2 | Recall | 0.851852 | bagged_rf_optimal | test |
3 | F1 | 0.836364 | bagged_rf_optimal | test |
4 | AUROC | 0.886863 | bagged_rf_optimal | test |
5 | Accuracy | 0.912088 | bagged_et_optimal | test |
6 | Precision | 0.851852 | bagged_et_optimal | test |
7 | Recall | 0.851852 | bagged_et_optimal | test |
8 | F1 | 0.851852 | bagged_et_optimal | test |
9 | AUROC | 0.894676 | bagged_et_optimal | test |
10 | Accuracy | 0.912088 | bagged_bdt_optimal | test |
11 | Precision | 0.851852 | bagged_bdt_optimal | test |
12 | Recall | 0.851852 | bagged_bdt_optimal | test |
13 | F1 | 0.851852 | bagged_bdt_optimal | test |
14 | AUROC | 0.894676 | bagged_bdt_optimal | test |
15 | Accuracy | 0.901099 | bagged_blr_optimal | test |
16 | Precision | 0.800000 | bagged_blr_optimal | test |
17 | Recall | 0.888889 | bagged_blr_optimal | test |
18 | F1 | 0.842105 | bagged_blr_optimal | test |
19 | AUROC | 0.897569 | bagged_blr_optimal | test |
20 | Accuracy | 0.912088 | bagged_bsvm_optimal | test |
21 | Precision | 0.827586 | bagged_bsvm_optimal | test |
22 | Recall | 0.888889 | bagged_bsvm_optimal | test |
23 | F1 | 0.857143 | bagged_bsvm_optimal | test |
24 | AUROC | 0.905382 | bagged_bsvm_optimal | test |
25 | Accuracy | 0.912088 | boosted_ab_optimal | test |
26 | Precision | 0.851852 | boosted_ab_optimal | test |
27 | Recall | 0.851852 | boosted_ab_optimal | test |
28 | F1 | 0.851852 | boosted_ab_optimal | test |
29 | AUROC | 0.894676 | boosted_ab_optimal | test |
30 | Accuracy | 0.923077 | boosted_gb_optimal | test |
31 | Precision | 0.884615 | boosted_gb_optimal | test |
32 | Recall | 0.851852 | boosted_gb_optimal | test |
33 | F1 | 0.867925 | boosted_gb_optimal | test |
34 | AUROC | 0.902488 | boosted_gb_optimal | test |
35 | Accuracy | 0.901099 | boosted_xgb_optimal | test |
36 | Precision | 0.846154 | boosted_xgb_optimal | test |
37 | Recall | 0.814815 | boosted_xgb_optimal | test |
38 | F1 | 0.830189 | boosted_xgb_optimal | test |
39 | AUROC | 0.876157 | boosted_xgb_optimal | test |
40 | Accuracy | 0.912088 | boosted_lgbm_optimal | test |
41 | Precision | 0.880000 | boosted_lgbm_optimal | test |
42 | Recall | 0.814815 | boosted_lgbm_optimal | test |
43 | F1 | 0.846154 | boosted_lgbm_optimal | test |
44 | AUROC | 0.883970 | boosted_lgbm_optimal | test |
45 | Accuracy | 0.912088 | boosted_cb_optimal | test |
46 | Precision | 0.851852 | boosted_cb_optimal | test |
47 | Recall | 0.851852 | boosted_cb_optimal | test |
48 | F1 | 0.851852 | boosted_cb_optimal | test |
49 | AUROC | 0.894676 | boosted_cb_optimal | test |
75 | Accuracy | 0.923077 | stacked_metalearner_lr_optimal | test |
76 | Precision | 0.857143 | stacked_metalearner_lr_optimal | test |
77 | Recall | 0.888889 | stacked_metalearner_lr_optimal | test |
78 | F1 | 0.872727 | stacked_metalearner_lr_optimal | test |
79 | AUROC | 0.913194 | stacked_metalearner_lr_optimal | test |
105 | Accuracy | 0.912088 | blended_metalearner_lr_optimal | test |
106 | Precision | 0.827586 | blended_metalearner_lr_optimal | test |
107 | Recall | 0.888889 | blended_metalearner_lr_optimal | test |
108 | F1 | 0.857143 | blended_metalearner_lr_optimal | test |
109 | AUROC | 0.905382 | blended_metalearner_lr_optimal | test |
##################################
# Gathering all model performance measures
# for the test data
##################################
ensemble_test_performance_Accuracy_test = ensemble_test_performance[(ensemble_test_performance['set']=='test') & (ensemble_test_performance['metric_name']=='Accuracy')].loc[:,"metric_value"]
ensemble_test_performance_Precision_test = ensemble_test_performance[(ensemble_test_performance['set']=='test') & (ensemble_test_performance['metric_name']=='Precision')].loc[:,"metric_value"]
ensemble_test_performance_Recall_test = ensemble_test_performance[(ensemble_test_performance['set']=='test') & (ensemble_test_performance['metric_name']=='Recall')].loc[:,"metric_value"]
ensemble_test_performance_F1_test = ensemble_test_performance[(ensemble_test_performance['set']=='test') & (ensemble_test_performance['metric_name']=='F1')].loc[:,"metric_value"]
ensemble_test_performance_AUROC_test = ensemble_test_performance[(ensemble_test_performance['set']=='test') & (ensemble_test_performance['metric_name']=='AUROC')].loc[:,"metric_value"]
##################################
# Combining all the model performance measures
# for the test data
##################################
ensemble_test_performance_all_plot_test = pd.DataFrame({'accuracy': ensemble_test_performance_Accuracy_test.values,
'precision': ensemble_test_performance_Precision_test.values,
'recall': ensemble_test_performance_Recall_test.values,
'f1': ensemble_test_performance_F1_test.values,
'auroc': ensemble_test_performance_AUROC_test.values},
index=ensemble_test_performance['model'].unique())
ensemble_test_performance_all_plot_test
accuracy | precision | recall | f1 | auroc | |
---|---|---|---|---|---|
bagged_rf_optimal | 0.901099 | 0.821429 | 0.851852 | 0.836364 | 0.886863 |
bagged_et_optimal | 0.912088 | 0.851852 | 0.851852 | 0.851852 | 0.894676 |
bagged_bdt_optimal | 0.912088 | 0.851852 | 0.851852 | 0.851852 | 0.894676 |
bagged_blr_optimal | 0.901099 | 0.800000 | 0.888889 | 0.842105 | 0.897569 |
bagged_bsvm_optimal | 0.912088 | 0.827586 | 0.888889 | 0.857143 | 0.905382 |
boosted_ab_optimal | 0.912088 | 0.851852 | 0.851852 | 0.851852 | 0.894676 |
boosted_gb_optimal | 0.923077 | 0.884615 | 0.851852 | 0.867925 | 0.902488 |
boosted_xgb_optimal | 0.901099 | 0.846154 | 0.814815 | 0.830189 | 0.876157 |
boosted_lgbm_optimal | 0.912088 | 0.880000 | 0.814815 | 0.846154 | 0.883970 |
boosted_cb_optimal | 0.912088 | 0.851852 | 0.851852 | 0.851852 | 0.894676 |
stacked_metalearner_lr_optimal | 0.923077 | 0.857143 | 0.888889 | 0.872727 | 0.913194 |
blended_metalearner_lr_optimal | 0.912088 | 0.827586 | 0.888889 | 0.857143 | 0.905382 |
##################################
# Consolidating all the final
# bagged, boosted, stacked and blended
# model performance measures
# for the train, validation and test data
##################################
ensemble_overall_performance = pd.concat([ensemble_train_validation_performance, ensemble_test_performance], axis=0)
##################################
# Consolidating all the F1 score
# model performance measures
# between the train, validation and test data
##################################
ensemble_overall_performance_F1 = ensemble_overall_performance[ensemble_overall_performance['metric_name']=='F1']
ensemble_overall_performance_F1_train = ensemble_overall_performance_F1[ensemble_overall_performance_F1['set']=='train'].loc[:,"metric_value"]
ensemble_overall_performance_F1_validation = ensemble_overall_performance_F1[ensemble_overall_performance_F1['set']=='validation'].loc[:,"metric_value"]
ensemble_overall_performance_F1_test = ensemble_overall_performance_F1[ensemble_overall_performance_F1['set']=='test'].loc[:,"metric_value"]
##################################
# Combining all the F1 score
# model performance measures
# between the train and validation data
##################################
ensemble_overall_performance_F1_plot = pd.DataFrame({'train': ensemble_overall_performance_F1_train.values,
'validation': ensemble_overall_performance_F1_validation.values,
'test': ensemble_overall_performance_F1_test.values},
index=ensemble_overall_performance_F1['model'].unique())
ensemble_overall_performance_F1_plot
train | validation | test | |
---|---|---|---|
bagged_rf_optimal | 0.833333 | 0.837209 | 0.836364 |
bagged_et_optimal | 0.833333 | 0.837209 | 0.851852 |
bagged_bdt_optimal | 0.846154 | 0.857143 | 0.851852 |
bagged_blr_optimal | 0.833333 | 0.837209 | 0.842105 |
bagged_bsvm_optimal | 0.852713 | 0.857143 | 0.857143 |
boosted_ab_optimal | 0.843750 | 0.857143 | 0.851852 |
boosted_gb_optimal | 0.910569 | 0.829268 | 0.867925 |
boosted_xgb_optimal | 0.850394 | 0.857143 | 0.830189 |
boosted_lgbm_optimal | 0.894309 | 0.820513 | 0.846154 |
boosted_cb_optimal | 0.843750 | 0.857143 | 0.851852 |
stacked_metalearner_lr_optimal | 0.852713 | 0.857143 | 0.872727 |
blended_metalearner_lr_optimal | 0.854962 | 0.883721 | 0.857143 |
##################################
# Plotting all the F1 score
# model performance measures
# between train, validation and test sets
##################################
ensemble_overall_performance_F1_plot = ensemble_overall_performance_F1_plot.plot.barh(figsize=(10, 10), width=0.9)
ensemble_overall_performance_F1_plot.set_xlim(0.00,1.00)
ensemble_overall_performance_F1_plot.set_title("Model Comparison by F1 Score Performance on Train, Validation and Test Data")
ensemble_overall_performance_F1_plot.set_xlabel("F1 Score Performance")
ensemble_overall_performance_F1_plot.set_ylabel("Ensemble Model")
ensemble_overall_performance_F1_plot.grid(False)
ensemble_overall_performance_F1_plot.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
for container in ensemble_overall_performance_F1_plot.containers:
ensemble_overall_performance_F1_plot.bar_label(container, fmt='%.5f', padding=-50, color='white', fontweight='bold')
##################################
# Computing the permutation importance
# for the final model determined as the blended model
# with a Logistic Regression meta learner comprised of the
# KNN, SVM, Ridge Classifier, Neural Network and Decision Tree base learners
##################################
base_learner_names = ['KNN', 'SVM', 'Ridge Classifier', 'Neural Network', 'Decision Tree']
perm_importance = permutation_importance(
blended_metalearner_lr_optimal, # Meta Learner
meta_validation_blended, # Meta Features (Base Learner Predictions)
y_preprocessed_validation_encoded, # True Labels
n_repeats=10,
random_state=42
)
# Obtaining the sorted indices in descending order
sorted_idx = perm_importance.importances_mean.argsort()[::-1]
# Plotting the feature importance
plt.figure(figsize=(17, 5))
plt.bar(range(len(perm_importance.importances_mean)), perm_importance.importances_mean[sorted_idx], align='center')
plt.xticks(range(len(perm_importance.importances_mean)), np.array(base_learner_names)[sorted_idx], rotation=90)
plt.xlabel("Base Learner")
plt.ylabel("Permutation Importance Score")
plt.title("Permutation Importance: Blended Model (Meta Learner: Logistic Regression, Base Learners: KNN, SVM, Ridge Classifier, Neural Network, Decision Tree)")
plt.show()
##################################
# Creating a function to compute the permutation importance
# for the KNN, SVM, Ridge Classifier, Neural Network and Decision Tree base learners
##################################
feature_names = ['Gender','Smoking','Physical_Examination','Adenopathy','Focality','Risk','T','Stage','Response','Age']
def compute_permutation_importance(model, X_evaluation, y_evaluation, model_name="Model", feature_names=feature_names, n_repeats=10, random_state=42):
# Computing permutation importance
perm_importance = permutation_importance(model, X_evaluation, y_evaluation, n_repeats=n_repeats, random_state=random_state)
# Getting the sorted indices (descending order)
sorted_idx = perm_importance.importances_mean.argsort()[::-1]
# Using feature names if provided, else using column indices
if feature_names is None:
feature_names = [f"Feature {i}" for i in range(X_evaluation.shape[1])]
# Plotting feature importance
plt.figure(figsize=(17, 5))
plt.bar(range(len(perm_importance.importances_mean)), perm_importance.importances_mean[sorted_idx], align='center')
plt.xticks(range(len(perm_importance.importances_mean)), np.array(feature_names)[sorted_idx], rotation=90)
plt.xlabel("Feature")
plt.ylabel("Permutation Importance Score")
plt.title(f"Feature Importance (Permutation): {model_name}")
plt.show()
return perm_importance
##################################
# Computing the permutation importance
# for the Ridge Classifier base learner
##################################
perm_importance_blended_baselearner_rc_optimal = compute_permutation_importance(blended_baselearner_rc_optimal,
X_preprocessed_train,
y_preprocessed_train_encoded,
"Optimal Blended Base Learner Ridge Classifier",
feature_names=feature_names)
##################################
# Computing the permutation importance
# for the Ridge Classifier base learner
##################################
perm_importance_blended_baselearner_svm_optimal = compute_permutation_importance(blended_baselearner_svm_optimal,
X_preprocessed_train,
y_preprocessed_train_encoded,
"Optimal Blended Base Learner SVM",
feature_names=feature_names)
##################################
# Computing the permutation importance
# for the Decision Tree base learner
##################################
perm_importance_blended_baselearner_dt_optimal = compute_permutation_importance(blended_baselearner_dt_optimal,
X_preprocessed_train,
y_preprocessed_train_encoded,
"Optimal Blended Base Learner Decision Tree",
feature_names=feature_names)
##################################
# Computing the permutation importance
# for the KNN base learner
##################################
perm_importance_blended_baselearner_knn_optimal = compute_permutation_importance(blended_baselearner_knn_optimal,
X_preprocessed_train,
y_preprocessed_train_encoded,
"Optimal Blended Base Learner KNN",
feature_names=feature_names)
##################################
# Computing the permutation importance
# for the Neural Network base learner
##################################
perm_importance_blended_baselearner_nn_optimal = compute_permutation_importance(blended_baselearner_nn_optimal,
X_preprocessed_train,
y_preprocessed_train_encoded,
"Optimal Blended Base Learner Neural Network",
feature_names=feature_names)
2. Summary ¶
3. References ¶
- [Book] Ensemble Methods for Machine Learning by Gautam Kunapuli
- [Book] Applied Predictive Modeling by Max Kuhn and Kjell Johnson
- [Book] An Introduction to Statistical Learning by Gareth James, Daniela Witten, Trevor Hastie and Rob Tibshirani
- [Book] Ensemble Methods: Foundations and Algorithms by Zhi-Hua Zhou
- [Book] Effective XGBoost: Optimizing, Tuning, Understanding, and Deploying Classification Models (Treading on Python) by Matt Harrison, Edward Krueger, Alex Rook, Ronald Legere and Bojan Tunguz
- [Python Library API] NumPy by NumPy Team
- [Python Library API] pandas by Pandas Team
- [Python Library API] seaborn by Seaborn Team
- [Python Library API] matplotlib.pyplot by MatPlotLib Team
- [Python Library API] matplotlib.image by MatPlotLib Team
- [Python Library API] matplotlib.offsetbox by MatPlotLib Team
- [Python Library API] itertools by Python Team
- [Python Library API] operator by Python Team
- [Python Library API] sklearn.experimental by Scikit-Learn Team
- [Python Library API] sklearn.impute by Scikit-Learn Team
- [Python Library API] sklearn.linear_model by Scikit-Learn Team
- [Python Library API] sklearn.preprocessing by Scikit-Learn Team
- [Python Library API] scipy by SciPy Team
- [Python Library API] sklearn.tree by Schttps://scikit-learn.org/stable/api/sklearn.neighbors.htmlikit-Learn Team
- [Python Library API] sklearn.ensemble by Scikit-Learn Team
- [Python Library API] sklearn.svm by Scikit-Learn Team
- [Python Library API] sklearn.metrics by Scikit-Learn Team
- [Python Library API] sklearn.neighbors by Scikit-Learn Team
- [Python Library API] sklearn.neural_network by Scikit-Learn Team
- [Python Library API] xgboost by XGBoost Team
- [Python Library API] lightgbm by LightGBM Team
- [Python Library API] catboost by CatBoost Team
- [Python Library API] imblearn.over_sampling by Imbalanced-Learn Team
- [Python Library API] imblearn.under_sampling by Imbalanced-Learn Team
- [Python Library API] StatsModels by StatsModels Team
- [Python Library API] SciPy by SciPy Team
- [Article] Ensemble: Boosting, Bagging, and Stacking Machine Learning by Jason Brownlee (MachineLearningMastery.Com)
- [Article] Stacking Machine Learning: Everything You Need to Know by Ada Parker (MachineLearningPro.Org)
- [Article] Ensemble Learning: Bagging, Boosting and Stacking by Edouard Duchesnay, Tommy Lofstedt and Feki Younes (Duchesnay.GitHub.IO)
- [Article] Stack Machine Learning Models: Get Better Results by Casper Hansen (Developer.IBM.Com)
- [Article] GradientBoosting vs AdaBoost vs XGBoost vs CatBoost vs LightGBM by Geeks for Geeks Team (GeeksForGeeks.Org)
- [Article] A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning by Jason Brownlee (MachineLearningMastery.Com)
- [Article] The Ultimate Guide to AdaBoost Algorithm | What is AdaBoost Algorithm? by Ashish Kumar (MyGreatLearning.Com)
- [Article] A Gentle Introduction to Ensemble Learning Algorithms by Jason Brownlee (MachineLearningMastery.Com)
- [Article] Ensemble Methods: Elegant Techniques to Produce Improved Machine Learning Results by Necati Demir (Toptal.Com)
- [Article] The Essential Guide to Ensemble Learning by Rohit Kundu (V7Labs.Com)
- [Article] Develop an Intuition for How Ensemble Learning Works by by Jason Brownlee (Machine Learning Mastery)
- [Article] Mastering Ensemble Techniques in Machine Learning: Bagging, Boosting, Bayes Optimal Classifier, and Stacking by Rahul Jain (Medium)
- [Article] Ensemble Learning: Bagging, Boosting, Stacking by Ayşe Kübra Kuyucu (Medium)
- [Article] Ensemble: Boosting, Bagging, and Stacking Machine Learning by Aleyna Şenozan (Medium)
- [Article] Boosting, Stacking, and Bagging for Ensemble Models for Time Series Analysis with Python by Kyle Jones (Medium)
- [Article] Different types of Ensemble Techniques — Bagging, Boosting, Stacking, Voting, Blending by Abhishek Jain (Medium)
- [Article] Mastering Ensemble Techniques in Machine Learning: Bagging, Boosting, Bayes Optimal Classifier, and Stacking by Rahul Jain (Medium)
- [Article] Understanding Ensemble Methods: Bagging, Boosting, and Stacking by Divya bhagat (Medium)
- [Video Tutorial] BAGGING vs. BOOSTING vs STACKING in Ensemble Learning | Machine Learning by Gate Smashers (YouTube)
- [Video Tutorial] What is Ensemble Method in Machine Learning | Bagging | Boosting | Stacking | Voting by Data_SPILL (YouTube)
- [Video Tutorial] Ensemble Methods | Bagging | Boosting | Stacking by World of Signet (YouTube)
- [Video Tutorial] Ensemble (Boosting, Bagging, and Stacking) in Machine Learning: Easy Explanation for Data Scientists by Emma Ding (YouTube)
- [Video Tutorial] Ensemble Learning - Bagging, Boosting, and Stacking explained in 4 minutes! by Melissa Van Bussel (YouTube)
- [Video Tutorial] Introduction to Ensemble Learning | Bagging , Boosting & Stacking Techniques by UncomplicatingTech (YouTube)
- [Video Tutorial] Machine Learning Basics: Ensemble Learning: Bagging, Boosting, Stacking by ISSAI_NU (YouTube)
- [Course] DataCamp Python Data Analyst Certificate by DataCamp Team (DataCamp)
- [Course] DataCamp Python Associate Data Scientist Certificate by DataCamp Team (DataCamp)
- [Course] DataCamp Python Data Scientist Certificate by DataCamp Team (DataCamp)
- [Course] DataCamp Machine Learning Engineer Certificate by DataCamp Team (DataCamp)
- [Course] DataCamp Machine Learning Scientist Certificate by DataCamp Team (DataCamp)
- [Course] IBM Data Analyst Professional Certificate by IBM Team (Coursera)
- [Course] IBM Data Science Professional Certificate by IBM Team (Coursera)
- [Course] IBM Machine Learning Professional Certificate by IBM Team (Coursera)
from IPython.display import display, HTML
display(HTML("<style>.rendered_html { font-size: 15px; font-family: 'Trebuchet MS'; }</style>"))