Model Deployment : Machine Learning Model Experiment Logging and Tracking Using Open-Source Frameworks¶


John Pauline Pineda

July 12, 2025


  • 1. Table of Contents
    • 1.1 Data Background
    • 1.2 Data Description
    • 1.3 Data Quality Assessment
    • 1.4 Data Preprocessing
      • 1.4.1 Data Splitting
      • 1.4.2 Data Profiling
      • 1.4.3 Category Aggregation and Encoding
      • 1.4.4 Outlier and Distributional Shape Analysis
      • 1.4.5 Collinearity
    • 1.5 Data Exploration
      • 1.5.1 Exploratory Data Analysis
      • 1.5.2 Hypothesis Testing
    • 1.6 Premodelling Data Preparation
      • 1.6.1 Preprocessed Data Description
      • 1.6.2 Preprocessing Pipeline Development
    • 1.7 Bagged Model Development, Logging and Tracking
      • 1.7.1 Random Forest
      • 1.7.2 Extra Trees
      • 1.7.3 Bagged Decision Trees
      • 1.7.4 Bagged Logistic Regression
      • 1.7.5 Bagged Support Vector Machine
    • 1.8 Boosted Model Development, Logging and Tracking
      • 1.8.1 AdaBoost
      • 1.8.2 Gradient Boosting
      • 1.8.3 XGBoost
      • 1.8.4 Light GBM
      • 1.8.5 CatBoost
    • 1.9 Artifact Storage
    • 1.10 Run Comparison
    • 1.11 Experiment Organization
    • 1.12 Consolidated Findings
  • 2. Summary
  • 3. References

1. Table of Contents ¶

1.1. Data Background ¶

An open Thyroid Disease Dataset from Kaggle (with all credits attributed to Jai Naru and Abuchi Onwuegbusi) was used for the analysis as consolidated from the following primary sources:

  1. Reference Repository entitled Differentiated Thyroid Cancer Recurrence from UC Irvine Machine Learning Repository
  2. Research Paper entitled Machine Learning for Risk Stratification of Thyroid Cancer Patients: a 15-year Cohort Study from the European Archives of Oto-Rhino-Laryngology

This study hypothesized that the various clinicopathological characteristics influence differentiated thyroid cancer recurrence between patients.

The dichotomous categorical variable for the study is:

  • Recurred - Status of the patient (Yes, Recurrence of differentiated thyroid cancer | No, No recurrence of differentiated thyroid cancer)

The predictor variables for the study are:

  • Age - Patient's age (Years)
  • Gender - Patient's sex (M | F)
  • Smoking - Indication of smoking (Yes | No)
  • Hx Smoking - Indication of smoking history (Yes | No)
  • Hx Radiotherapy - Indication of radiotherapy history for any condition (Yes | No)
  • Thyroid Function - Status of thyroid function (Clinical Hyperthyroidism, Hypothyroidism | Subclinical Hyperthyroidism, Hypothyroidism | Euthyroid)
  • Physical Examination - Findings from physical examination including palpation of the thyroid gland and surrounding structures (Normal | Diffuse Goiter | Multinodular Goiter | Single Nodular Goiter Left, Right)
  • Adenopathy - Indication of enlarged lymph nodes in the neck region (No | Right | Extensive | Left | Bilateral | Posterior)
  • Pathology - Specific thyroid cancer type as determined by pathology examination of biopsy samples (Follicular | Hurthel Cell | Micropapillary | Papillary)
  • Focality - Indication if the cancer is limited to one location or present in multiple locations (Uni-Focal | Multi-Focal)
  • Risk - Risk category of the cancer based on various factors, such as tumor size, extent of spread, and histological type (Low | Intermediate | High)
  • T - Tumor classification based on its size and extent of invasion into nearby structures (T1a | T1b | T2 | T3a | T3b | T4a | T4b)
  • N - Nodal classification indicating the involvement of lymph nodes (N0 | N1a | N1b)
  • M - Metastasis classification indicating the presence or absence of distant metastases (M0 | M1)
  • Stage - Overall stage of the cancer, typically determined by combining T, N, and M classifications (I | II | III | IVa | IVb)
  • Response - Cancer's response to treatment (Biochemical Incomplete | Indeterminate | Excellent | Structural Incomplete)

1.2. Data Description ¶

  1. The initial tabular dataset was comprised of 383 observations and 17 variables (including 1 target and 16 predictors).
    • 383 rows (observations)
    • 17 columns (variables)
      • 1/17 target (categorical)
        • Recurred
      • 1/17 predictor (numeric)
        • Age
      • 16/17 predictor (categorical)
        • Gender
        • Smoking
        • Hx_Smoking
        • Hx_Radiotherapy
        • Thyroid_Function
        • Physical_Examination
        • Adenopathy
        • Pathology
        • Focality
        • Risk
        • T
        • N
        • M
        • Stage
        • Response
In [75]:
##################################
# Loading Python Libraries
##################################
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import joblib
import itertools
import os
import pickle
%matplotlib inline

from operator import add,mul,truediv
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.preprocessing import OrdinalEncoder, StandardScaler
from scipy import stats
from scipy.stats import pointbiserialr, chi2_contingency

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, BaggingClassifier, AdaBoostClassifier, GradientBoostingClassifier, StackingClassifier
from sklearn.neural_network import MLPClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix, ConfusionMatrixDisplay, classification_report
from sklearn.model_selection import train_test_split, GridSearchCV, RepeatedStratifiedKFold, KFold, cross_val_score
from sklearn.inspection import permutation_importance
In [76]:
##################################
# Defining file paths
##################################
DATASETS_ORIGINAL_PATH = r"datasets\original"
DATASETS_FINAL_PATH = r"datasets\final\complete"
DATASETS_FINAL_TRAIN_PATH = r"datasets\final\train"
DATASETS_FINAL_TRAIN_FEATURES_PATH = r"datasets\final\train\features"
DATASETS_FINAL_TRAIN_TARGET_PATH = r"datasets\final\train\target"
DATASETS_FINAL_VALIDATION_PATH = r"datasets\final\validation"
DATASETS_FINAL_VALIDATION_FEATURES_PATH = r"datasets\final\validation\features"
DATASETS_FINAL_VALIDATION_TARGET_PATH = r"datasets\final\validation\target"
DATASETS_FINAL_TEST_PATH = r"datasets\final\test"
DATASETS_FINAL_TEST_FEATURES_PATH = r"datasets\final\test\features"
DATASETS_FINAL_TEST_TARGET_PATH = r"datasets\final\test\target"
DATASETS_PREPROCESSED_PATH = r"datasets\preprocessed"
DATASETS_PREPROCESSED_TRAIN_PATH = r"datasets\preprocessed\train"
DATASETS_PREPROCESSED_TRAIN_FEATURES_PATH = r"datasets\preprocessed\train\features"
DATASETS_PREPROCESSED_TRAIN_TARGET_PATH = r"datasets\preprocessed\train\target"
DATASETS_PREPROCESSED_VALIDATION_PATH = r"datasets\preprocessed\validation"
DATASETS_PREPROCESSED_VALIDATION_FEATURES_PATH = r"datasets\preprocessed\validation\features"
DATASETS_PREPROCESSED_VALIDATION_TARGET_PATH = r"datasets\preprocessed\validation\target"
DATASETS_PREPROCESSED_TEST_PATH = r"datasets\preprocessed\test"
DATASETS_PREPROCESSED_TEST_FEATURES_PATH = r"datasets\preprocessed\test\features"
DATASETS_PREPROCESSED_TEST_TARGET_PATH = r"datasets\preprocessed\test\target"
MODELS_PATH = r"models"
In [77]:
##################################
# Loading the dataset
# from the DATASETS_ORIGINAL_PATH
##################################
thyroid_cancer = pd.read_csv(os.path.join("..", DATASETS_ORIGINAL_PATH, "Thyroid_Diff.csv"))
In [78]:
##################################
# Performing a general exploration of the dataset
##################################
print('Dataset Dimensions: ')
display(thyroid_cancer.shape)
Dataset Dimensions: 
(383, 17)
In [79]:
##################################
# Listing the column names and data types
##################################
print('Column Names and Data Types:')
display(thyroid_cancer.dtypes)
Column Names and Data Types:
Age                      int64
Gender                  object
Smoking                 object
Hx Smoking              object
Hx Radiotherapy         object
Thyroid Function        object
Physical Examination    object
Adenopathy              object
Pathology               object
Focality                object
Risk                    object
T                       object
N                       object
M                       object
Stage                   object
Response                object
Recurred                object
dtype: object
In [80]:
##################################
# Renaming and standardizing the column names
# to replace blanks with undercores
##################################
thyroid_cancer.columns = thyroid_cancer.columns.str.replace(" ", "_")
In [81]:
##################################
# Taking a snapshot of the dataset
##################################
thyroid_cancer.head()
Out[81]:
Age Gender Smoking Hx_Smoking Hx_Radiotherapy Thyroid_Function Physical_Examination Adenopathy Pathology Focality Risk T N M Stage Response Recurred
0 27 F No No No Euthyroid Single nodular goiter-left No Micropapillary Uni-Focal Low T1a N0 M0 I Indeterminate No
1 34 F No Yes No Euthyroid Multinodular goiter No Micropapillary Uni-Focal Low T1a N0 M0 I Excellent No
2 30 F No No No Euthyroid Single nodular goiter-right No Micropapillary Uni-Focal Low T1a N0 M0 I Excellent No
3 62 F No No No Euthyroid Single nodular goiter-right No Micropapillary Uni-Focal Low T1a N0 M0 I Excellent No
4 62 F No No No Euthyroid Multinodular goiter No Micropapillary Multi-Focal Low T1a N0 M0 I Excellent No
In [82]:
##################################
# Selecting categorical columns (both object and categorical types)
# and listing the unique categorical levels
##################################
cat_cols = thyroid_cancer.select_dtypes(include=["object", "category"]).columns
for col in cat_cols:
    print(f"Categorical | Object Column: {col}")
    print(thyroid_cancer[col].unique())  
    print("-" * 40)
Categorical | Object Column: Gender
['F' 'M']
----------------------------------------
Categorical | Object Column: Smoking
['No' 'Yes']
----------------------------------------
Categorical | Object Column: Hx_Smoking
['No' 'Yes']
----------------------------------------
Categorical | Object Column: Hx_Radiotherapy
['No' 'Yes']
----------------------------------------
Categorical | Object Column: Thyroid_Function
['Euthyroid' 'Clinical Hyperthyroidism' 'Clinical Hypothyroidism'
 'Subclinical Hyperthyroidism' 'Subclinical Hypothyroidism']
----------------------------------------
Categorical | Object Column: Physical_Examination
['Single nodular goiter-left' 'Multinodular goiter'
 'Single nodular goiter-right' 'Normal' 'Diffuse goiter']
----------------------------------------
Categorical | Object Column: Adenopathy
['No' 'Right' 'Extensive' 'Left' 'Bilateral' 'Posterior']
----------------------------------------
Categorical | Object Column: Pathology
['Micropapillary' 'Papillary' 'Follicular' 'Hurthel cell']
----------------------------------------
Categorical | Object Column: Focality
['Uni-Focal' 'Multi-Focal']
----------------------------------------
Categorical | Object Column: Risk
['Low' 'Intermediate' 'High']
----------------------------------------
Categorical | Object Column: T
['T1a' 'T1b' 'T2' 'T3a' 'T3b' 'T4a' 'T4b']
----------------------------------------
Categorical | Object Column: N
['N0' 'N1b' 'N1a']
----------------------------------------
Categorical | Object Column: M
['M0' 'M1']
----------------------------------------
Categorical | Object Column: Stage
['I' 'II' 'IVB' 'III' 'IVA']
----------------------------------------
Categorical | Object Column: Response
['Indeterminate' 'Excellent' 'Structural Incomplete'
 'Biochemical Incomplete']
----------------------------------------
Categorical | Object Column: Recurred
['No' 'Yes']
----------------------------------------
In [83]:
##################################
# Correcting a category level
##################################
thyroid_cancer["Pathology"] = thyroid_cancer["Pathology"].replace("Hurthel cell", "Hurthle Cell")
In [84]:
##################################
# Setting the levels of the categorical variables
##################################
thyroid_cancer['Recurred'] = thyroid_cancer['Recurred'].astype('category')
thyroid_cancer['Recurred'] = thyroid_cancer['Recurred'].cat.set_categories(['No', 'Yes'], ordered=True)
thyroid_cancer['Gender'] = thyroid_cancer['Gender'].astype('category')
thyroid_cancer['Gender'] = thyroid_cancer['Gender'].cat.set_categories(['M', 'F'], ordered=True)
thyroid_cancer['Smoking'] = thyroid_cancer['Smoking'].astype('category')
thyroid_cancer['Smoking'] = thyroid_cancer['Smoking'].cat.set_categories(['No', 'Yes'], ordered=True)
thyroid_cancer['Hx_Smoking'] = thyroid_cancer['Hx_Smoking'].astype('category')
thyroid_cancer['Hx_Smoking'] = thyroid_cancer['Hx_Smoking'].cat.set_categories(['No', 'Yes'], ordered=True)
thyroid_cancer['Hx_Radiotherapy'] = thyroid_cancer['Hx_Radiotherapy'].astype('category')
thyroid_cancer['Hx_Radiotherapy'] = thyroid_cancer['Hx_Radiotherapy'].cat.set_categories(['No', 'Yes'], ordered=True)
thyroid_cancer['Thyroid_Function'] = thyroid_cancer['Thyroid_Function'].astype('category')
thyroid_cancer['Thyroid_Function'] = thyroid_cancer['Thyroid_Function'].cat.set_categories(['Euthyroid', 'Subclinical Hypothyroidism', 'Subclinical Hyperthyroidism', 'Clinical Hypothyroidism', 'Clinical Hyperthyroidism'], ordered=True)
thyroid_cancer['Physical_Examination'] = thyroid_cancer['Physical_Examination'].astype('category')
thyroid_cancer['Physical_Examination'] = thyroid_cancer['Physical_Examination'].cat.set_categories(['Normal', 'Single nodular goiter-left', 'Single nodular goiter-right', 'Multinodular goiter', 'Diffuse goiter'], ordered=True)
thyroid_cancer['Adenopathy'] = thyroid_cancer['Adenopathy'].astype('category')
thyroid_cancer['Adenopathy'] = thyroid_cancer['Adenopathy'].cat.set_categories(['No', 'Left', 'Right', 'Bilateral', 'Posterior', 'Extensive'], ordered=True)
thyroid_cancer['Pathology'] = thyroid_cancer['Pathology'].astype('category')
thyroid_cancer['Pathology'] = thyroid_cancer['Pathology'].cat.set_categories(['Hurthle Cell', 'Follicular', 'Micropapillary', 'Papillary'], ordered=True)
thyroid_cancer['Focality'] = thyroid_cancer['Focality'].astype('category')
thyroid_cancer['Focality'] = thyroid_cancer['Focality'].cat.set_categories(['Uni-Focal', 'Multi-Focal'], ordered=True)
thyroid_cancer['Risk'] = thyroid_cancer['Risk'].astype('category')
thyroid_cancer['Risk'] = thyroid_cancer['Risk'].cat.set_categories(['Low', 'Intermediate', 'High'], ordered=True)
thyroid_cancer['T'] = thyroid_cancer['T'].astype('category')
thyroid_cancer['T'] = thyroid_cancer['T'].cat.set_categories(['T1a', 'T1b', 'T2', 'T3a', 'T3b', 'T4a', 'T4b'], ordered=True)
thyroid_cancer['N'] = thyroid_cancer['N'].astype('category')
thyroid_cancer['N'] = thyroid_cancer['N'].cat.set_categories(['N0', 'N1a', 'N1b'], ordered=True)
thyroid_cancer['M'] = thyroid_cancer['M'].astype('category')
thyroid_cancer['M'] = thyroid_cancer['M'].cat.set_categories(['M0', 'M1'], ordered=True)
thyroid_cancer['Stage'] = thyroid_cancer['Stage'].astype('category')
thyroid_cancer['Stage'] = thyroid_cancer['Stage'].cat.set_categories(['I', 'II', 'III', 'IVA', 'IVB'], ordered=True)
thyroid_cancer['Response'] = thyroid_cancer['Response'].astype('category')
thyroid_cancer['Response'] = thyroid_cancer['Response'].cat.set_categories(['Excellent', 'Structural Incomplete', 'Biochemical Incomplete', 'Indeterminate'], ordered=True)
In [85]:
##################################
# Performing a general exploration of the numeric variables
##################################
print('Numeric Variable Summary:')
display(thyroid_cancer.describe(include='number').transpose())
Numeric Variable Summary:
count mean std min 25% 50% 75% max
Age 383.0 40.866841 15.134494 15.0 29.0 37.0 51.0 82.0
In [86]:
##################################
# Performing a general exploration of the categorical variables
##################################
print('Categorical Variable Summary:')
display(thyroid_cancer.describe(include='category').transpose())
Categorical Variable Summary:
count unique top freq
Gender 383 2 F 312
Smoking 383 2 No 334
Hx_Smoking 383 2 No 355
Hx_Radiotherapy 383 2 No 376
Thyroid_Function 383 5 Euthyroid 332
Physical_Examination 383 5 Single nodular goiter-right 140
Adenopathy 383 6 No 277
Pathology 383 4 Papillary 287
Focality 383 2 Uni-Focal 247
Risk 383 3 Low 249
T 383 7 T2 151
N 383 3 N0 268
M 383 2 M0 365
Stage 383 5 I 333
Response 383 4 Excellent 208
Recurred 383 2 No 275
In [87]:
##################################
# Performing a general exploration of the categorical variable levels
# based on the ordered categories
##################################
ordered_cat_cols = thyroid_cancer.select_dtypes(include=["category"]).columns
for col in ordered_cat_cols:
    print(f"Column: {col}")
    print("Absolute Frequencies:")
    print(thyroid_cancer[col].value_counts().reindex(thyroid_cancer[col].cat.categories))
    print("\nNormalized Frequencies:")
    print(thyroid_cancer[col].value_counts(normalize=True).reindex(thyroid_cancer[col].cat.categories))
    print("-" * 50)
Column: Gender
Absolute Frequencies:
M     71
F    312
Name: count, dtype: int64

Normalized Frequencies:
M    0.185379
F    0.814621
Name: proportion, dtype: float64
--------------------------------------------------
Column: Smoking
Absolute Frequencies:
No     334
Yes     49
Name: count, dtype: int64

Normalized Frequencies:
No     0.872063
Yes    0.127937
Name: proportion, dtype: float64
--------------------------------------------------
Column: Hx_Smoking
Absolute Frequencies:
No     355
Yes     28
Name: count, dtype: int64

Normalized Frequencies:
No     0.926893
Yes    0.073107
Name: proportion, dtype: float64
--------------------------------------------------
Column: Hx_Radiotherapy
Absolute Frequencies:
No     376
Yes      7
Name: count, dtype: int64

Normalized Frequencies:
No     0.981723
Yes    0.018277
Name: proportion, dtype: float64
--------------------------------------------------
Column: Thyroid_Function
Absolute Frequencies:
Euthyroid                      332
Subclinical Hypothyroidism      14
Subclinical Hyperthyroidism      5
Clinical Hypothyroidism         12
Clinical Hyperthyroidism        20
Name: count, dtype: int64

Normalized Frequencies:
Euthyroid                      0.866841
Subclinical Hypothyroidism     0.036554
Subclinical Hyperthyroidism    0.013055
Clinical Hypothyroidism        0.031332
Clinical Hyperthyroidism       0.052219
Name: proportion, dtype: float64
--------------------------------------------------
Column: Physical_Examination
Absolute Frequencies:
Normal                           7
Single nodular goiter-left      89
Single nodular goiter-right    140
Multinodular goiter            140
Diffuse goiter                   7
Name: count, dtype: int64

Normalized Frequencies:
Normal                         0.018277
Single nodular goiter-left     0.232376
Single nodular goiter-right    0.365535
Multinodular goiter            0.365535
Diffuse goiter                 0.018277
Name: proportion, dtype: float64
--------------------------------------------------
Column: Adenopathy
Absolute Frequencies:
No           277
Left          17
Right         48
Bilateral     32
Posterior      2
Extensive      7
Name: count, dtype: int64

Normalized Frequencies:
No           0.723238
Left         0.044386
Right        0.125326
Bilateral    0.083551
Posterior    0.005222
Extensive    0.018277
Name: proportion, dtype: float64
--------------------------------------------------
Column: Pathology
Absolute Frequencies:
Hurthle Cell       20
Follicular         28
Micropapillary     48
Papillary         287
Name: count, dtype: int64

Normalized Frequencies:
Hurthle Cell      0.052219
Follicular        0.073107
Micropapillary    0.125326
Papillary         0.749347
Name: proportion, dtype: float64
--------------------------------------------------
Column: Focality
Absolute Frequencies:
Uni-Focal      247
Multi-Focal    136
Name: count, dtype: int64

Normalized Frequencies:
Uni-Focal      0.644909
Multi-Focal    0.355091
Name: proportion, dtype: float64
--------------------------------------------------
Column: Risk
Absolute Frequencies:
Low             249
Intermediate    102
High             32
Name: count, dtype: int64

Normalized Frequencies:
Low             0.650131
Intermediate    0.266319
High            0.083551
Name: proportion, dtype: float64
--------------------------------------------------
Column: T
Absolute Frequencies:
T1a     49
T1b     43
T2     151
T3a     96
T3b     16
T4a     20
T4b      8
Name: count, dtype: int64

Normalized Frequencies:
T1a    0.127937
T1b    0.112272
T2     0.394256
T3a    0.250653
T3b    0.041775
T4a    0.052219
T4b    0.020888
Name: proportion, dtype: float64
--------------------------------------------------
Column: N
Absolute Frequencies:
N0     268
N1a     22
N1b     93
Name: count, dtype: int64

Normalized Frequencies:
N0     0.699739
N1a    0.057441
N1b    0.242820
Name: proportion, dtype: float64
--------------------------------------------------
Column: M
Absolute Frequencies:
M0    365
M1     18
Name: count, dtype: int64

Normalized Frequencies:
M0    0.953003
M1    0.046997
Name: proportion, dtype: float64
--------------------------------------------------
Column: Stage
Absolute Frequencies:
I      333
II      32
III      4
IVA      3
IVB     11
Name: count, dtype: int64

Normalized Frequencies:
I      0.869452
II     0.083551
III    0.010444
IVA    0.007833
IVB    0.028721
Name: proportion, dtype: float64
--------------------------------------------------
Column: Response
Absolute Frequencies:
Excellent                 208
Structural Incomplete      91
Biochemical Incomplete     23
Indeterminate              61
Name: count, dtype: int64

Normalized Frequencies:
Excellent                 0.543081
Structural Incomplete     0.237598
Biochemical Incomplete    0.060052
Indeterminate             0.159269
Name: proportion, dtype: float64
--------------------------------------------------
Column: Recurred
Absolute Frequencies:
No     275
Yes    108
Name: count, dtype: int64

Normalized Frequencies:
No     0.718016
Yes    0.281984
Name: proportion, dtype: float64
--------------------------------------------------

1.3. Data Quality Assessment ¶

Data quality findings based on assessment are as follows:

  1. A total of 19 duplicated rows were identified.
    • In total, 34 observations were affected, consisting of 16 unique occurrences and 19 subsequent duplicates.
    • These 19 duplicates spanned 16 distinct variations, meaning some variations had multiple duplicates.
    • To clean the dataset, all 19 duplicate rows were removed, retaining only the first occurrence of each of the 16 unique variations.
  2. No missing data noted for any variable with Null.Count>0 and Fill.Rate<1.0.
  3. Low variance observed for 8 variables with First.Second.Mode.Ratio>5.
    • Hx_Radiotherapy: First.Second.Mode.Ratio = 51.000 (comprised 2 category levels)
    • M: First.Second.Mode.Ratio = 19.222 (comprised 2 category levels)
    • Thyroid_Function: First.Second.Mode.Ratio = 15.650 (comprised 5 category levels)
    • Hx_Smoking: First.Second.Mode.Ratio = 12.000 (comprised 2 category levels)
    • Stage: First.Second.Mode.Ratio = 9.812 (comprised 5 category levels)
    • Smoking: First.Second.Mode.Ratio = 6.428 (comprised 2 category levels)
    • Pathology: First.Second.Mode.Ratio = 6.022 (comprised 4 category levels)
    • Adenopathy: First.Second.Mode.Ratio = 5.375 (comprised 5 category levels)
  4. No low variance observed for any variable with Unique.Count.Ratio>10.
  5. No high skewness observed for any variable with Skewness>3 or Skewness<(-3).
In [88]:
##################################
# Counting the number of duplicated rows
##################################
thyroid_cancer.duplicated().sum()
Out[88]:
np.int64(19)
In [89]:
##################################
# Exploring the duplicated rows
##################################
duplicated_rows = thyroid_cancer[thyroid_cancer.duplicated(keep=False)]
display(duplicated_rows)
Age Gender Smoking Hx_Smoking Hx_Radiotherapy Thyroid_Function Physical_Examination Adenopathy Pathology Focality Risk T N M Stage Response Recurred
8 51 F No No No Euthyroid Single nodular goiter-right No Micropapillary Uni-Focal Low T1a N0 M0 I Excellent No
9 40 F No No No Euthyroid Single nodular goiter-right No Micropapillary Uni-Focal Low T1a N0 M0 I Excellent No
22 36 F No No No Euthyroid Single nodular goiter-right No Micropapillary Uni-Focal Low T1a N0 M0 I Excellent No
32 36 F No No No Euthyroid Single nodular goiter-right No Micropapillary Uni-Focal Low T1a N0 M0 I Excellent No
38 40 F No No No Euthyroid Single nodular goiter-right No Micropapillary Uni-Focal Low T1a N0 M0 I Excellent No
40 51 F No No No Euthyroid Single nodular goiter-right No Micropapillary Uni-Focal Low T1a N0 M0 I Excellent No
61 35 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T1b N0 M0 I Excellent No
66 35 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T1b N0 M0 I Excellent No
67 51 F No No No Euthyroid Single nodular goiter-left No Papillary Uni-Focal Low T1b N0 M0 I Excellent No
69 51 F No No No Euthyroid Single nodular goiter-left No Papillary Uni-Focal Low T1b N0 M0 I Excellent No
73 29 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T1b N0 M0 I Excellent No
77 29 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T1b N0 M0 I Excellent No
106 26 F No No No Euthyroid Multinodular goiter No Papillary Uni-Focal Low T2 N0 M0 I Excellent No
110 31 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T2 N0 M0 I Excellent No
113 32 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T2 N0 M0 I Excellent No
115 37 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T2 N0 M0 I Excellent No
119 28 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T2 N0 M0 I Excellent No
120 37 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T2 N0 M0 I Excellent No
121 26 F No No No Euthyroid Multinodular goiter No Papillary Uni-Focal Low T2 N0 M0 I Excellent No
123 28 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T2 N0 M0 I Excellent No
132 32 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T2 N0 M0 I Excellent No
136 21 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T2 N0 M0 I Excellent No
137 32 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T2 N0 M0 I Excellent No
138 26 F No No No Euthyroid Multinodular goiter No Papillary Uni-Focal Low T2 N0 M0 I Excellent No
142 42 F No No No Euthyroid Multinodular goiter No Papillary Uni-Focal Low T2 N0 M0 I Excellent No
161 22 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T2 N0 M0 I Excellent No
166 31 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T2 N0 M0 I Excellent No
168 21 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T2 N0 M0 I Excellent No
170 38 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T2 N0 M0 I Excellent No
175 34 F No No No Euthyroid Multinodular goiter No Papillary Uni-Focal Low T2 N0 M0 I Excellent No
178 38 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T2 N0 M0 I Excellent No
183 26 F No No No Euthyroid Multinodular goiter No Papillary Uni-Focal Low T2 N0 M0 I Excellent No
187 34 F No No No Euthyroid Multinodular goiter No Papillary Uni-Focal Low T2 N0 M0 I Excellent No
189 42 F No No No Euthyroid Multinodular goiter No Papillary Uni-Focal Low T2 N0 M0 I Excellent No
196 22 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T2 N0 M0 I Excellent No
In [90]:
##################################
# Checking if duplicated rows have identical values across all columns
##################################
num_unique_dup_rows = duplicated_rows.drop_duplicates().shape[0]
num_total_dup_rows = duplicated_rows.shape[0]
if num_unique_dup_rows == 1:
    print("All duplicated rows have the same values across all columns.")
else:
    print(f"There are {num_unique_dup_rows} unique versions among the {num_total_dup_rows} duplicated rows.")
There are 16 unique versions among the 35 duplicated rows.
In [91]:
##################################
# Counting the unique variations among duplicated rows
##################################
unique_dup_variations = duplicated_rows.drop_duplicates()
variation_counts = duplicated_rows.value_counts().reset_index(name="Count")
print("Unique duplicated row variations and their counts:")
display(variation_counts)
Unique duplicated row variations and their counts:
Age Gender Smoking Hx_Smoking Hx_Radiotherapy Thyroid_Function Physical_Examination Adenopathy Pathology Focality Risk T N M Stage Response Recurred Count
0 26 F No No No Euthyroid Multinodular goiter No Papillary Uni-Focal Low T2 N0 M0 I Excellent No 4
1 32 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T2 N0 M0 I Excellent No 3
2 22 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T2 N0 M0 I Excellent No 2
3 21 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T2 N0 M0 I Excellent No 2
4 28 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T2 N0 M0 I Excellent No 2
5 29 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T1b N0 M0 I Excellent No 2
6 31 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T2 N0 M0 I Excellent No 2
7 34 F No No No Euthyroid Multinodular goiter No Papillary Uni-Focal Low T2 N0 M0 I Excellent No 2
8 35 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T1b N0 M0 I Excellent No 2
9 36 F No No No Euthyroid Single nodular goiter-right No Micropapillary Uni-Focal Low T1a N0 M0 I Excellent No 2
10 37 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T2 N0 M0 I Excellent No 2
11 38 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T2 N0 M0 I Excellent No 2
12 40 F No No No Euthyroid Single nodular goiter-right No Micropapillary Uni-Focal Low T1a N0 M0 I Excellent No 2
13 42 F No No No Euthyroid Multinodular goiter No Papillary Uni-Focal Low T2 N0 M0 I Excellent No 2
14 51 F No No No Euthyroid Single nodular goiter-left No Papillary Uni-Focal Low T1b N0 M0 I Excellent No 2
15 51 F No No No Euthyroid Single nodular goiter-right No Micropapillary Uni-Focal Low T1a N0 M0 I Excellent No 2
In [92]:
##################################
# Removing the duplicated rows and
# retaining only the first occurrence
##################################
thyroid_cancer_row_filtered = thyroid_cancer.drop_duplicates(keep="first")
print('Dataset Dimensions: ')
display(thyroid_cancer_row_filtered.shape)
Dataset Dimensions: 
(364, 17)
In [93]:
##################################
# Gathering the data types for each column
##################################
data_type_list = list(thyroid_cancer_row_filtered.dtypes)
In [94]:
##################################
# Gathering the variable names for each column
##################################
variable_name_list = list(thyroid_cancer_row_filtered.columns)
In [95]:
##################################
# Gathering the number of observations for each column
##################################
row_count_list = list([len(thyroid_cancer_row_filtered)] * len(thyroid_cancer_row_filtered.columns))
In [96]:
##################################
# Gathering the number of missing data for each column
##################################
null_count_list = list(thyroid_cancer_row_filtered.isna().sum(axis=0))
In [97]:
##################################
# Gathering the number of non-missing data for each column
##################################
non_null_count_list = list(thyroid_cancer_row_filtered.count())
In [98]:
##################################
# Gathering the missing data percentage for each column
##################################
fill_rate_list = map(truediv, non_null_count_list, row_count_list)
In [99]:
##################################
# Formulating the summary
# for all columns
##################################
all_column_quality_summary = pd.DataFrame(zip(variable_name_list,
                                              data_type_list,
                                              row_count_list,
                                              non_null_count_list,
                                              null_count_list,
                                              fill_rate_list), 
                                        columns=['Column.Name',
                                                 'Column.Type',
                                                 'Row.Count',
                                                 'Non.Null.Count',
                                                 'Null.Count',                                                 
                                                 'Fill.Rate'])
display(all_column_quality_summary)
Column.Name Column.Type Row.Count Non.Null.Count Null.Count Fill.Rate
0 Age int64 364 364 0 1.0
1 Gender category 364 364 0 1.0
2 Smoking category 364 364 0 1.0
3 Hx_Smoking category 364 364 0 1.0
4 Hx_Radiotherapy category 364 364 0 1.0
5 Thyroid_Function category 364 364 0 1.0
6 Physical_Examination category 364 364 0 1.0
7 Adenopathy category 364 364 0 1.0
8 Pathology category 364 364 0 1.0
9 Focality category 364 364 0 1.0
10 Risk category 364 364 0 1.0
11 T category 364 364 0 1.0
12 N category 364 364 0 1.0
13 M category 364 364 0 1.0
14 Stage category 364 364 0 1.0
15 Response category 364 364 0 1.0
16 Recurred category 364 364 0 1.0
In [100]:
##################################
# Counting the number of columns
# with Fill.Rate < 1.00
##################################
len(all_column_quality_summary[(all_column_quality_summary['Fill.Rate']<1)])
Out[100]:
0
In [101]:
##################################
# Identifying the rows
# with Fill.Rate < 0.90
##################################
column_low_fill_rate = all_column_quality_summary[(all_column_quality_summary['Fill.Rate']<0.90)]
In [102]:
##################################
# Gathering the indices for each observation
##################################
row_index_list = thyroid_cancer_row_filtered.index
In [103]:
##################################
# Gathering the number of columns for each observation
##################################
column_count_list = list([len(thyroid_cancer_row_filtered.columns)] * len(thyroid_cancer_row_filtered))
In [104]:
##################################
# Gathering the number of missing data for each row
##################################
null_row_list = list(thyroid_cancer_row_filtered.isna().sum(axis=1))
In [105]:
##################################
# Gathering the missing data percentage for each column
##################################
missing_rate_list = map(truediv, null_row_list, column_count_list)
In [106]:
##################################
# Identifying the rows
# with missing data
##################################
all_row_quality_summary = pd.DataFrame(zip(row_index_list,
                                           column_count_list,
                                           null_row_list,
                                           missing_rate_list), 
                                        columns=['Row.Name',
                                                 'Column.Count',
                                                 'Null.Count',                                                 
                                                 'Missing.Rate'])
display(all_row_quality_summary)
Row.Name Column.Count Null.Count Missing.Rate
0 0 17 0 0.0
1 1 17 0 0.0
2 2 17 0 0.0
3 3 17 0 0.0
4 4 17 0 0.0
... ... ... ... ...
359 378 17 0 0.0
360 379 17 0 0.0
361 380 17 0 0.0
362 381 17 0 0.0
363 382 17 0 0.0

364 rows × 4 columns

In [107]:
##################################
# Counting the number of rows
# with Missing.Rate > 0.00
##################################
len(all_row_quality_summary[(all_row_quality_summary['Missing.Rate']>0.00)])
Out[107]:
0
In [108]:
##################################
# Formulating the dataset
# with numeric columns only
##################################
thyroid_cancer_numeric = thyroid_cancer_row_filtered.select_dtypes(include='number')
In [109]:
##################################
# Gathering the variable names for each numeric column
##################################
numeric_variable_name_list = thyroid_cancer_numeric.columns
In [110]:
##################################
# Gathering the minimum value for each numeric column
##################################
numeric_minimum_list = thyroid_cancer_numeric.min()
In [111]:
##################################
# Gathering the mean value for each numeric column
##################################
numeric_mean_list = thyroid_cancer_numeric.mean()
In [112]:
##################################
# Gathering the median value for each numeric column
##################################
numeric_median_list = thyroid_cancer_numeric.median()
In [113]:
##################################
# Gathering the maximum value for each numeric column
##################################
numeric_maximum_list = thyroid_cancer_numeric.max()
In [114]:
##################################
# Gathering the first mode values for each numeric column
##################################
numeric_first_mode_list = [thyroid_cancer_row_filtered[x].value_counts(dropna=True).index.tolist()[0] for x in thyroid_cancer_numeric]
In [115]:
##################################
# Gathering the second mode values for each numeric column
##################################
numeric_second_mode_list = [thyroid_cancer_row_filtered[x].value_counts(dropna=True).index.tolist()[1] for x in thyroid_cancer_numeric]
In [116]:
##################################
# Gathering the count of first mode values for each numeric column
##################################
numeric_first_mode_count_list = [thyroid_cancer_numeric[x].isin([thyroid_cancer_row_filtered[x].value_counts(dropna=True).index.tolist()[0]]).sum() for x in thyroid_cancer_numeric]
In [117]:
##################################
# Gathering the count of second mode values for each numeric column
##################################
numeric_second_mode_count_list = [thyroid_cancer_numeric[x].isin([thyroid_cancer_row_filtered[x].value_counts(dropna=True).index.tolist()[1]]).sum() for x in thyroid_cancer_numeric]
In [118]:
##################################
# Gathering the first mode to second mode ratio for each numeric column
##################################
numeric_first_second_mode_ratio_list = map(truediv, numeric_first_mode_count_list, numeric_second_mode_count_list)
In [119]:
##################################
# Gathering the count of unique values for each numeric column
##################################
numeric_unique_count_list = thyroid_cancer_numeric.nunique(dropna=True)
In [120]:
##################################
# Gathering the number of observations for each numeric column
##################################
numeric_row_count_list = list([len(thyroid_cancer_numeric)] * len(thyroid_cancer_numeric.columns))
In [121]:
##################################
# Gathering the unique to count ratio for each numeric column
##################################
numeric_unique_count_ratio_list = map(truediv, numeric_unique_count_list, numeric_row_count_list)
In [122]:
##################################
# Gathering the skewness value for each numeric column
##################################
numeric_skewness_list = thyroid_cancer_numeric.skew()
In [123]:
##################################
# Gathering the kurtosis value for each numeric column
##################################
numeric_kurtosis_list = thyroid_cancer_numeric.kurtosis()
In [124]:
##################################
# Generating a column quality summary for the numeric column
##################################
numeric_column_quality_summary = pd.DataFrame(zip(numeric_variable_name_list,
                                                numeric_minimum_list,
                                                numeric_mean_list,
                                                numeric_median_list,
                                                numeric_maximum_list,
                                                numeric_first_mode_list,
                                                numeric_second_mode_list,
                                                numeric_first_mode_count_list,
                                                numeric_second_mode_count_list,
                                                numeric_first_second_mode_ratio_list,
                                                numeric_unique_count_list,
                                                numeric_row_count_list,
                                                numeric_unique_count_ratio_list,
                                                numeric_skewness_list,
                                                numeric_kurtosis_list), 
                                        columns=['Numeric.Column.Name',
                                                 'Minimum',
                                                 'Mean',
                                                 'Median',
                                                 'Maximum',
                                                 'First.Mode',
                                                 'Second.Mode',
                                                 'First.Mode.Count',
                                                 'Second.Mode.Count',
                                                 'First.Second.Mode.Ratio',
                                                 'Unique.Count',
                                                 'Row.Count',
                                                 'Unique.Count.Ratio',
                                                 'Skewness',
                                                 'Kurtosis'])
display(numeric_column_quality_summary)
Numeric.Column.Name Minimum Mean Median Maximum First.Mode Second.Mode First.Mode.Count Second.Mode.Count First.Second.Mode.Ratio Unique.Count Row.Count Unique.Count.Ratio Skewness Kurtosis
0 Age 15 41.25 38.0 82 31 27 21 13 1.615385 65 364 0.178571 0.678269 -0.359255
In [125]:
##################################
# Counting the number of numeric columns
# with First.Second.Mode.Ratio > 5.00
##################################
len(numeric_column_quality_summary[(numeric_column_quality_summary['First.Second.Mode.Ratio']>5)])
Out[125]:
0
In [126]:
##################################
# Counting the number of numeric columns
# with Unique.Count.Ratio > 10.00
##################################
len(numeric_column_quality_summary[(numeric_column_quality_summary['Unique.Count.Ratio']>10)])
Out[126]:
0
In [127]:
##################################
# Counting the number of numeric columns
# with Skewness > 3.00 or Skewness < -3.00
##################################
len(numeric_column_quality_summary[(numeric_column_quality_summary['Skewness']>3) | (numeric_column_quality_summary['Skewness']<(-3))])
Out[127]:
0
In [128]:
##################################
# Formulating the dataset
# with categorical columns only
##################################
thyroid_cancer_categorical = thyroid_cancer_row_filtered.select_dtypes(include='category')
In [129]:
##################################
# Gathering the variable names for the categorical column
##################################
categorical_variable_name_list = thyroid_cancer_categorical.columns
In [130]:
##################################
# Gathering the first mode values for each categorical column
##################################
categorical_first_mode_list = [thyroid_cancer_row_filtered[x].value_counts().index.tolist()[0] for x in thyroid_cancer_categorical]
In [131]:
##################################
# Gathering the second mode values for each categorical column
##################################
categorical_second_mode_list = [thyroid_cancer_row_filtered[x].value_counts().index.tolist()[1] for x in thyroid_cancer_categorical]
In [132]:
##################################
# Gathering the count of first mode values for each categorical column
##################################
categorical_first_mode_count_list = [thyroid_cancer_categorical[x].isin([thyroid_cancer_row_filtered[x].value_counts(dropna=True).index.tolist()[0]]).sum() for x in thyroid_cancer_categorical]
In [133]:
##################################
# Gathering the count of second mode values for each categorical column
##################################
categorical_second_mode_count_list = [thyroid_cancer_categorical[x].isin([thyroid_cancer_row_filtered[x].value_counts(dropna=True).index.tolist()[1]]).sum() for x in thyroid_cancer_categorical]
In [134]:
##################################
# Gathering the first mode to second mode ratio for each categorical column
##################################
categorical_first_second_mode_ratio_list = map(truediv, categorical_first_mode_count_list, categorical_second_mode_count_list)
In [135]:
##################################
# Gathering the count of unique values for each categorical column
##################################
categorical_unique_count_list = thyroid_cancer_categorical.nunique(dropna=True)
In [136]:
##################################
# Gathering the number of observations for each categorical column
##################################
categorical_row_count_list = list([len(thyroid_cancer_categorical)] * len(thyroid_cancer_categorical.columns))
In [137]:
##################################
# Gathering the unique to count ratio for each categorical column
##################################
categorical_unique_count_ratio_list = map(truediv, categorical_unique_count_list, categorical_row_count_list)
In [138]:
##################################
# Generating a column quality summary for the categorical columns
##################################
categorical_column_quality_summary = pd.DataFrame(zip(categorical_variable_name_list,
                                                    categorical_first_mode_list,
                                                    categorical_second_mode_list,
                                                    categorical_first_mode_count_list,
                                                    categorical_second_mode_count_list,
                                                    categorical_first_second_mode_ratio_list,
                                                    categorical_unique_count_list,
                                                    categorical_row_count_list,
                                                    categorical_unique_count_ratio_list), 
                                        columns=['Categorical.Column.Name',
                                                 'First.Mode',
                                                 'Second.Mode',
                                                 'First.Mode.Count',
                                                 'Second.Mode.Count',
                                                 'First.Second.Mode.Ratio',
                                                 'Unique.Count',
                                                 'Row.Count',
                                                 'Unique.Count.Ratio'])
display(categorical_column_quality_summary)
Categorical.Column.Name First.Mode Second.Mode First.Mode.Count Second.Mode.Count First.Second.Mode.Ratio Unique.Count Row.Count Unique.Count.Ratio
0 Gender F M 293 71 4.126761 2 364 0.005495
1 Smoking No Yes 315 49 6.428571 2 364 0.005495
2 Hx_Smoking No Yes 336 28 12.000000 2 364 0.005495
3 Hx_Radiotherapy No Yes 357 7 51.000000 2 364 0.005495
4 Thyroid_Function Euthyroid Clinical Hyperthyroidism 313 20 15.650000 5 364 0.013736
5 Physical_Examination Multinodular goiter Single nodular goiter-right 135 127 1.062992 5 364 0.013736
6 Adenopathy No Right 258 48 5.375000 6 364 0.016484
7 Pathology Papillary Micropapillary 271 45 6.022222 4 364 0.010989
8 Focality Uni-Focal Multi-Focal 228 136 1.676471 2 364 0.005495
9 Risk Low Intermediate 230 102 2.254902 3 364 0.008242
10 T T2 T3a 138 96 1.437500 7 364 0.019231
11 N N0 N1b 249 93 2.677419 3 364 0.008242
12 M M0 M1 346 18 19.222222 2 364 0.005495
13 Stage I II 314 32 9.812500 5 364 0.013736
14 Response Excellent Structural Incomplete 189 91 2.076923 4 364 0.010989
15 Recurred No Yes 256 108 2.370370 2 364 0.005495
In [139]:
##################################
# Counting the number of categorical columns
# with First.Second.Mode.Ratio > 5.00
##################################
len(categorical_column_quality_summary[(categorical_column_quality_summary['First.Second.Mode.Ratio']>5)])
Out[139]:
8
In [140]:
##################################
# Identifying the categorical columns
# with First.Second.Mode.Ratio > 5.00
##################################
display(categorical_column_quality_summary[(categorical_column_quality_summary['First.Second.Mode.Ratio']>5)].sort_values(by=['First.Second.Mode.Ratio'], ascending=False))
Categorical.Column.Name First.Mode Second.Mode First.Mode.Count Second.Mode.Count First.Second.Mode.Ratio Unique.Count Row.Count Unique.Count.Ratio
3 Hx_Radiotherapy No Yes 357 7 51.000000 2 364 0.005495
12 M M0 M1 346 18 19.222222 2 364 0.005495
4 Thyroid_Function Euthyroid Clinical Hyperthyroidism 313 20 15.650000 5 364 0.013736
2 Hx_Smoking No Yes 336 28 12.000000 2 364 0.005495
13 Stage I II 314 32 9.812500 5 364 0.013736
1 Smoking No Yes 315 49 6.428571 2 364 0.005495
7 Pathology Papillary Micropapillary 271 45 6.022222 4 364 0.010989
6 Adenopathy No Right 258 48 5.375000 6 364 0.016484
In [141]:
##################################
# Counting the number of categorical columns
# with Unique.Count.Ratio > 10.00
##################################
len(categorical_column_quality_summary[(categorical_column_quality_summary['Unique.Count.Ratio']>10)])
Out[141]:
0

1.4. Data Preprocessing ¶

1.4.1 Data Splitting¶

  1. The baseline dataset (with duplicate rows removed from the original dataset) is comprised of:
    • 364 rows (observations)
      • 256 Recurred=No: 70.33%
      • 108 Recurred=Yes: 29.67%
    • 17 columns (variables)
      • 1/17 target (categorical)
        • Recurred
      • 1/17 predictor (numeric)
        • Age
      • 15/17 predictor (categorical)
        • Gender
        • Smoking
        • Hx_Smoking
        • Hx_Radiotherapy
        • Thyroid_Function
        • Physical_Examination
        • Adenopathy
        • Pathology
        • Focality
        • Risk
        • T
        • N
        • M
        • Stage
        • Response
  2. The baseline dataset was divided into three subsets using a fixed random seed:
    • test data: 25% of the original data with class stratification applied
    • train data (initial): 75% of the original data with class stratification applied
      • train data (final): 75% of the train (initial) data with class stratification applied
      • validation data: 25% of the train (initial) data with class stratification applied
  3. Models were developed from the train data (final). Using the same dataset, a subset of models with optimal hyperparameters were selected, based on cross-validation.
  4. Among candidate models with optimal hyperparameters, the final model were selected based on performance on the validation data.
  5. Performance of the selected final model (and other candidate models for post-model selection comparison) were evaluated using the test data.
  6. The train data (final) subset is comprised of:
    • 204 rows (observations)
      • 143 Recurred=No: 70.10%
      • 61 Recurred=Yes: 29.90%
    • 17 columns (variables)
  7. The validation data subset is comprised of:
    • 69 rows (observations)
      • 49 Recurred=No: 71.01%
      • 20 Recurred=Yes: 28.98%
    • 17 columns (variables)
  8. The test data subset is comprised of:
    • 91 rows (observations)
      • 64 Recurred=No: 70.33%
      • 27 Recurred=Yes: 29.67%
    • 17 columns (variables)
In [142]:
##################################
# Creating a dataset copy
# of the row filtered data
##################################
thyroid_cancer_baseline = thyroid_cancer_row_filtered.copy()
In [143]:
##################################
# Performing a general exploration
# of the baseline dataset
##################################
print('Final Dataset Dimensions: ')
display(thyroid_cancer_baseline.shape)
Final Dataset Dimensions: 
(364, 17)
In [144]:
##################################
# Obtaining the distribution of
# of the target variable
##################################
print('Target Variable Breakdown: ')
thyroid_cancer_breakdown = thyroid_cancer_baseline.groupby('Recurred', observed=True).size().reset_index(name='Count')
thyroid_cancer_breakdown['Percentage'] = (thyroid_cancer_breakdown['Count'] / len(thyroid_cancer_baseline)) * 100
display(thyroid_cancer_breakdown)
Target Variable Breakdown: 
Recurred Count Percentage
0 No 256 70.32967
1 Yes 108 29.67033
In [145]:
##################################
# Formulating the train and test data
# from the final dataset
# by applying stratification and
# using a 75-25 ratio
##################################
thyroid_cancer_train_initial, thyroid_cancer_test = train_test_split(thyroid_cancer_baseline, 
                                                               test_size=0.25, 
                                                               stratify=thyroid_cancer_baseline['Recurred'], 
                                                               random_state=987654321)
In [146]:
##################################
# Performing a general exploration
# of the initial training dataset
##################################
X_train_initial = thyroid_cancer_train_initial.drop('Recurred', axis = 1)
y_train_initial = thyroid_cancer_train_initial['Recurred']
print('Initial Train Dataset Dimensions: ')
display(X_train_initial.shape)
display(y_train_initial.shape)
print('Initial Train Target Variable Breakdown: ')
display(y_train_initial.value_counts())
print('Initial Train Target Variable Proportion: ')
display(y_train_initial.value_counts(normalize = True))
Initial Train Dataset Dimensions: 
(273, 16)
(273,)
Initial Train Target Variable Breakdown: 
Recurred
No     192
Yes     81
Name: count, dtype: int64
Initial Train Target Variable Proportion: 
Recurred
No     0.703297
Yes    0.296703
Name: proportion, dtype: float64
In [147]:
##################################
# Performing a general exploration
# of the test dataset
##################################
X_test = thyroid_cancer_test.drop('Recurred', axis = 1)
y_test = thyroid_cancer_test['Recurred']
print('Test Dataset Dimensions: ')
display(X_test.shape)
display(y_test.shape)
print('Test Target Variable Breakdown: ')
display(y_test.value_counts())
print('Test Target Variable Proportion: ')
display(y_test.value_counts(normalize = True))
Test Dataset Dimensions: 
(91, 16)
(91,)
Test Target Variable Breakdown: 
Recurred
No     64
Yes    27
Name: count, dtype: int64
Test Target Variable Proportion: 
Recurred
No     0.703297
Yes    0.296703
Name: proportion, dtype: float64
In [148]:
##################################
# Formulating the train and validation data
# from the train dataset
# by applying stratification and
# using a 75-25 ratio
##################################
thyroid_cancer_train, thyroid_cancer_validation = train_test_split(thyroid_cancer_train_initial, 
                                                             test_size=0.25, 
                                                             stratify=thyroid_cancer_train_initial['Recurred'], 
                                                             random_state=987654321)
In [149]:
##################################
# Performing a general exploration
# of the final training dataset
##################################
X_train = thyroid_cancer_train.drop('Recurred', axis = 1)
y_train = thyroid_cancer_train['Recurred']
print('Final Train Dataset Dimensions: ')
display(X_train.shape)
display(y_train.shape)
print('Final Train Target Variable Breakdown: ')
display(y_train.value_counts())
print('Final Train Target Variable Proportion: ')
display(y_train.value_counts(normalize = True))
Final Train Dataset Dimensions: 
(204, 16)
(204,)
Final Train Target Variable Breakdown: 
Recurred
No     143
Yes     61
Name: count, dtype: int64
Final Train Target Variable Proportion: 
Recurred
No     0.70098
Yes    0.29902
Name: proportion, dtype: float64
In [150]:
##################################
# Performing a general exploration
# of the validation dataset
##################################
X_validation = thyroid_cancer_validation.drop('Recurred', axis = 1)
y_validation = thyroid_cancer_validation['Recurred']
print('Validation Dataset Dimensions: ')
display(X_validation.shape)
display(y_validation.shape)
print('Validation Target Variable Breakdown: ')
display(y_validation.value_counts())
print('Validation Target Variable Proportion: ')
display(y_validation.value_counts(normalize = True))
Validation Dataset Dimensions: 
(69, 16)
(69,)
Validation Target Variable Breakdown: 
Recurred
No     49
Yes    20
Name: count, dtype: int64
Validation Target Variable Proportion: 
Recurred
No     0.710145
Yes    0.289855
Name: proportion, dtype: float64
In [151]:
##################################
# Saving the training data
# to the DATASETS_FINAL_TRAIN_PATH
# and DATASETS_FINAL_TRAIN_FEATURES_PATH
# and DATASETS_FINAL_TRAIN_TARGET_PATH
##################################
thyroid_cancer_train.to_csv(os.path.join("..", DATASETS_FINAL_TRAIN_PATH, "thyroid_cancer_train.csv"), index=False)
X_train.to_csv(os.path.join("..", DATASETS_FINAL_TRAIN_FEATURES_PATH, "X_train.csv"), index=False)
y_train.to_csv(os.path.join("..", DATASETS_FINAL_TRAIN_TARGET_PATH, "y_train.csv"), index=False)
In [152]:
##################################
# Saving the validation data
# to the DATASETS_FINAL_VALIDATION_PATH
# and DATASETS_FINAL_VALIDATION_FEATURE_PATH
# and DATASETS_FINAL_VALIDATION_TARGET_PATH
##################################
thyroid_cancer_validation.to_csv(os.path.join("..", DATASETS_FINAL_VALIDATION_PATH, "thyroid_cancer_validation.csv"), index=False)
X_validation.to_csv(os.path.join("..", DATASETS_FINAL_VALIDATION_FEATURES_PATH, "X_validation.csv"), index=False)
y_validation.to_csv(os.path.join("..", DATASETS_FINAL_VALIDATION_TARGET_PATH, "y_validation.csv"), index=False)
In [153]:
##################################
# Saving the test data
# to the DATASETS_FINAL_TEST_PATH
# and DATASETS_FINAL_TEST_FEATURES_PATH
# and DATASETS_FINAL_TEST_TARGET_PATH
##################################
thyroid_cancer_test.to_csv(os.path.join("..", DATASETS_FINAL_TEST_PATH, "thyroid_cancer_test.csv"), index=False)
X_test.to_csv(os.path.join("..", DATASETS_FINAL_TEST_FEATURES_PATH, "X_test.csv"), index=False)
y_test.to_csv(os.path.join("..", DATASETS_FINAL_TEST_TARGET_PATH, "y_test.csv"), index=False)

1.4.2 Data Profiling¶

1.4.3 Category Aggregation and Encoding¶

1.4.4 Outlier and Distributional Shape Analysis¶

1.4.5 Collinearity¶

1.5. Data Exploration ¶

1.5.1 Exploratory Data Analysis¶

1.5.2 Hypothesis Testing¶

1.6. Premodelling Data Preparation ¶

1.6.1 Preprocessed Data Description¶

1.6.2 Preprocessing Pipeline Development¶

1.7. Bagged Model Development, Logging and Tracking ¶

1.7.1 Random Forest¶

1.7.2 Extra Trees¶

1.7.3 Bagged Decision Trees¶

1.7.4 Bagged Logistic Regression¶

1.7.5 Bagged Support Vector Machine¶

1.8. Boosted Model Development, Logging and Tracking ¶

1.8.1 Random Forest¶

1.8.2 Extra Trees¶

1.8.3 Bagged Decision Trees¶

1.8.4 Bagged Logistic Regression¶

1.8.5 Bagged Support Vector Machine¶

1.9. Artifact Storage ¶

1.10. Run Comparison ¶

1.11. Experiment Organization ¶

1.12. Consolidated Findings ¶

2. Summary ¶

3. References ¶