Model Deployment : Machine Learning Model Experiment Logging and Tracking Using Open-Source Frameworks¶
- 1. Table of Contents
- 1.1 Data Background
- 1.2 Data Description
- 1.3 Data Quality Assessment
- 1.4 Data Preprocessing
- 1.5 Data Exploration
- 1.6 Premodelling Data Preparation
- 1.7 Bagged Model Development, Logging and Tracking
- 1.8 Boosted Model Development, Logging and Tracking
- 1.9 Artifact Storage
- 1.10 Run Comparison
- 1.11 Experiment Organization
- 1.12 Consolidated Findings
- 2. Summary
- 3. References
1. Table of Contents ¶
1.1. Data Background ¶
An open Thyroid Disease Dataset from Kaggle (with all credits attributed to Jai Naru and Abuchi Onwuegbusi) was used for the analysis as consolidated from the following primary sources:
- Reference Repository entitled Differentiated Thyroid Cancer Recurrence from UC Irvine Machine Learning Repository
- Research Paper entitled Machine Learning for Risk Stratification of Thyroid Cancer Patients: a 15-year Cohort Study from the European Archives of Oto-Rhino-Laryngology
This study hypothesized that the various clinicopathological characteristics influence differentiated thyroid cancer recurrence between patients.
The dichotomous categorical variable for the study is:
- Recurred - Status of the patient (Yes, Recurrence of differentiated thyroid cancer | No, No recurrence of differentiated thyroid cancer)
The predictor variables for the study are:
- Age - Patient's age (Years)
- Gender - Patient's sex (M | F)
- Smoking - Indication of smoking (Yes | No)
- Hx Smoking - Indication of smoking history (Yes | No)
- Hx Radiotherapy - Indication of radiotherapy history for any condition (Yes | No)
- Thyroid Function - Status of thyroid function (Clinical Hyperthyroidism, Hypothyroidism | Subclinical Hyperthyroidism, Hypothyroidism | Euthyroid)
- Physical Examination - Findings from physical examination including palpation of the thyroid gland and surrounding structures (Normal | Diffuse Goiter | Multinodular Goiter | Single Nodular Goiter Left, Right)
- Adenopathy - Indication of enlarged lymph nodes in the neck region (No | Right | Extensive | Left | Bilateral | Posterior)
- Pathology - Specific thyroid cancer type as determined by pathology examination of biopsy samples (Follicular | Hurthel Cell | Micropapillary | Papillary)
- Focality - Indication if the cancer is limited to one location or present in multiple locations (Uni-Focal | Multi-Focal)
- Risk - Risk category of the cancer based on various factors, such as tumor size, extent of spread, and histological type (Low | Intermediate | High)
- T - Tumor classification based on its size and extent of invasion into nearby structures (T1a | T1b | T2 | T3a | T3b | T4a | T4b)
- N - Nodal classification indicating the involvement of lymph nodes (N0 | N1a | N1b)
- M - Metastasis classification indicating the presence or absence of distant metastases (M0 | M1)
- Stage - Overall stage of the cancer, typically determined by combining T, N, and M classifications (I | II | III | IVa | IVb)
- Response - Cancer's response to treatment (Biochemical Incomplete | Indeterminate | Excellent | Structural Incomplete)
1.2. Data Description ¶
- The initial tabular dataset was comprised of 383 observations and 17 variables (including 1 target and 16 predictors).
- 383 rows (observations)
- 17 columns (variables)
- 1/17 target (categorical)
- Recurred
- 1/17 predictor (numeric)
- Age
- 16/17 predictor (categorical)
- Gender
- Smoking
- Hx_Smoking
- Hx_Radiotherapy
- Thyroid_Function
- Physical_Examination
- Adenopathy
- Pathology
- Focality
- Risk
- T
- N
- M
- Stage
- Response
- 1/17 target (categorical)
In [75]:
##################################
# Loading Python Libraries
##################################
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import joblib
import itertools
import os
import pickle
%matplotlib inline
from operator import add,mul,truediv
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.preprocessing import OrdinalEncoder, StandardScaler
from scipy import stats
from scipy.stats import pointbiserialr, chi2_contingency
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, BaggingClassifier, AdaBoostClassifier, GradientBoostingClassifier, StackingClassifier
from sklearn.neural_network import MLPClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix, ConfusionMatrixDisplay, classification_report
from sklearn.model_selection import train_test_split, GridSearchCV, RepeatedStratifiedKFold, KFold, cross_val_score
from sklearn.inspection import permutation_importance
In [76]:
##################################
# Defining file paths
##################################
DATASETS_ORIGINAL_PATH = r"datasets\original"
DATASETS_FINAL_PATH = r"datasets\final\complete"
DATASETS_FINAL_TRAIN_PATH = r"datasets\final\train"
DATASETS_FINAL_TRAIN_FEATURES_PATH = r"datasets\final\train\features"
DATASETS_FINAL_TRAIN_TARGET_PATH = r"datasets\final\train\target"
DATASETS_FINAL_VALIDATION_PATH = r"datasets\final\validation"
DATASETS_FINAL_VALIDATION_FEATURES_PATH = r"datasets\final\validation\features"
DATASETS_FINAL_VALIDATION_TARGET_PATH = r"datasets\final\validation\target"
DATASETS_FINAL_TEST_PATH = r"datasets\final\test"
DATASETS_FINAL_TEST_FEATURES_PATH = r"datasets\final\test\features"
DATASETS_FINAL_TEST_TARGET_PATH = r"datasets\final\test\target"
DATASETS_PREPROCESSED_PATH = r"datasets\preprocessed"
DATASETS_PREPROCESSED_TRAIN_PATH = r"datasets\preprocessed\train"
DATASETS_PREPROCESSED_TRAIN_FEATURES_PATH = r"datasets\preprocessed\train\features"
DATASETS_PREPROCESSED_TRAIN_TARGET_PATH = r"datasets\preprocessed\train\target"
DATASETS_PREPROCESSED_VALIDATION_PATH = r"datasets\preprocessed\validation"
DATASETS_PREPROCESSED_VALIDATION_FEATURES_PATH = r"datasets\preprocessed\validation\features"
DATASETS_PREPROCESSED_VALIDATION_TARGET_PATH = r"datasets\preprocessed\validation\target"
DATASETS_PREPROCESSED_TEST_PATH = r"datasets\preprocessed\test"
DATASETS_PREPROCESSED_TEST_FEATURES_PATH = r"datasets\preprocessed\test\features"
DATASETS_PREPROCESSED_TEST_TARGET_PATH = r"datasets\preprocessed\test\target"
MODELS_PATH = r"models"
In [77]:
##################################
# Loading the dataset
# from the DATASETS_ORIGINAL_PATH
##################################
thyroid_cancer = pd.read_csv(os.path.join("..", DATASETS_ORIGINAL_PATH, "Thyroid_Diff.csv"))
In [78]:
##################################
# Performing a general exploration of the dataset
##################################
print('Dataset Dimensions: ')
display(thyroid_cancer.shape)
Dataset Dimensions:
(383, 17)
In [79]:
##################################
# Listing the column names and data types
##################################
print('Column Names and Data Types:')
display(thyroid_cancer.dtypes)
Column Names and Data Types:
Age int64 Gender object Smoking object Hx Smoking object Hx Radiotherapy object Thyroid Function object Physical Examination object Adenopathy object Pathology object Focality object Risk object T object N object M object Stage object Response object Recurred object dtype: object
In [80]:
##################################
# Renaming and standardizing the column names
# to replace blanks with undercores
##################################
thyroid_cancer.columns = thyroid_cancer.columns.str.replace(" ", "_")
In [81]:
##################################
# Taking a snapshot of the dataset
##################################
thyroid_cancer.head()
Out[81]:
Age | Gender | Smoking | Hx_Smoking | Hx_Radiotherapy | Thyroid_Function | Physical_Examination | Adenopathy | Pathology | Focality | Risk | T | N | M | Stage | Response | Recurred | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 27 | F | No | No | No | Euthyroid | Single nodular goiter-left | No | Micropapillary | Uni-Focal | Low | T1a | N0 | M0 | I | Indeterminate | No |
1 | 34 | F | No | Yes | No | Euthyroid | Multinodular goiter | No | Micropapillary | Uni-Focal | Low | T1a | N0 | M0 | I | Excellent | No |
2 | 30 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Micropapillary | Uni-Focal | Low | T1a | N0 | M0 | I | Excellent | No |
3 | 62 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Micropapillary | Uni-Focal | Low | T1a | N0 | M0 | I | Excellent | No |
4 | 62 | F | No | No | No | Euthyroid | Multinodular goiter | No | Micropapillary | Multi-Focal | Low | T1a | N0 | M0 | I | Excellent | No |
In [82]:
##################################
# Selecting categorical columns (both object and categorical types)
# and listing the unique categorical levels
##################################
cat_cols = thyroid_cancer.select_dtypes(include=["object", "category"]).columns
for col in cat_cols:
print(f"Categorical | Object Column: {col}")
print(thyroid_cancer[col].unique())
print("-" * 40)
Categorical | Object Column: Gender ['F' 'M'] ---------------------------------------- Categorical | Object Column: Smoking ['No' 'Yes'] ---------------------------------------- Categorical | Object Column: Hx_Smoking ['No' 'Yes'] ---------------------------------------- Categorical | Object Column: Hx_Radiotherapy ['No' 'Yes'] ---------------------------------------- Categorical | Object Column: Thyroid_Function ['Euthyroid' 'Clinical Hyperthyroidism' 'Clinical Hypothyroidism' 'Subclinical Hyperthyroidism' 'Subclinical Hypothyroidism'] ---------------------------------------- Categorical | Object Column: Physical_Examination ['Single nodular goiter-left' 'Multinodular goiter' 'Single nodular goiter-right' 'Normal' 'Diffuse goiter'] ---------------------------------------- Categorical | Object Column: Adenopathy ['No' 'Right' 'Extensive' 'Left' 'Bilateral' 'Posterior'] ---------------------------------------- Categorical | Object Column: Pathology ['Micropapillary' 'Papillary' 'Follicular' 'Hurthel cell'] ---------------------------------------- Categorical | Object Column: Focality ['Uni-Focal' 'Multi-Focal'] ---------------------------------------- Categorical | Object Column: Risk ['Low' 'Intermediate' 'High'] ---------------------------------------- Categorical | Object Column: T ['T1a' 'T1b' 'T2' 'T3a' 'T3b' 'T4a' 'T4b'] ---------------------------------------- Categorical | Object Column: N ['N0' 'N1b' 'N1a'] ---------------------------------------- Categorical | Object Column: M ['M0' 'M1'] ---------------------------------------- Categorical | Object Column: Stage ['I' 'II' 'IVB' 'III' 'IVA'] ---------------------------------------- Categorical | Object Column: Response ['Indeterminate' 'Excellent' 'Structural Incomplete' 'Biochemical Incomplete'] ---------------------------------------- Categorical | Object Column: Recurred ['No' 'Yes'] ----------------------------------------
In [83]:
##################################
# Correcting a category level
##################################
thyroid_cancer["Pathology"] = thyroid_cancer["Pathology"].replace("Hurthel cell", "Hurthle Cell")
In [84]:
##################################
# Setting the levels of the categorical variables
##################################
thyroid_cancer['Recurred'] = thyroid_cancer['Recurred'].astype('category')
thyroid_cancer['Recurred'] = thyroid_cancer['Recurred'].cat.set_categories(['No', 'Yes'], ordered=True)
thyroid_cancer['Gender'] = thyroid_cancer['Gender'].astype('category')
thyroid_cancer['Gender'] = thyroid_cancer['Gender'].cat.set_categories(['M', 'F'], ordered=True)
thyroid_cancer['Smoking'] = thyroid_cancer['Smoking'].astype('category')
thyroid_cancer['Smoking'] = thyroid_cancer['Smoking'].cat.set_categories(['No', 'Yes'], ordered=True)
thyroid_cancer['Hx_Smoking'] = thyroid_cancer['Hx_Smoking'].astype('category')
thyroid_cancer['Hx_Smoking'] = thyroid_cancer['Hx_Smoking'].cat.set_categories(['No', 'Yes'], ordered=True)
thyroid_cancer['Hx_Radiotherapy'] = thyroid_cancer['Hx_Radiotherapy'].astype('category')
thyroid_cancer['Hx_Radiotherapy'] = thyroid_cancer['Hx_Radiotherapy'].cat.set_categories(['No', 'Yes'], ordered=True)
thyroid_cancer['Thyroid_Function'] = thyroid_cancer['Thyroid_Function'].astype('category')
thyroid_cancer['Thyroid_Function'] = thyroid_cancer['Thyroid_Function'].cat.set_categories(['Euthyroid', 'Subclinical Hypothyroidism', 'Subclinical Hyperthyroidism', 'Clinical Hypothyroidism', 'Clinical Hyperthyroidism'], ordered=True)
thyroid_cancer['Physical_Examination'] = thyroid_cancer['Physical_Examination'].astype('category')
thyroid_cancer['Physical_Examination'] = thyroid_cancer['Physical_Examination'].cat.set_categories(['Normal', 'Single nodular goiter-left', 'Single nodular goiter-right', 'Multinodular goiter', 'Diffuse goiter'], ordered=True)
thyroid_cancer['Adenopathy'] = thyroid_cancer['Adenopathy'].astype('category')
thyroid_cancer['Adenopathy'] = thyroid_cancer['Adenopathy'].cat.set_categories(['No', 'Left', 'Right', 'Bilateral', 'Posterior', 'Extensive'], ordered=True)
thyroid_cancer['Pathology'] = thyroid_cancer['Pathology'].astype('category')
thyroid_cancer['Pathology'] = thyroid_cancer['Pathology'].cat.set_categories(['Hurthle Cell', 'Follicular', 'Micropapillary', 'Papillary'], ordered=True)
thyroid_cancer['Focality'] = thyroid_cancer['Focality'].astype('category')
thyroid_cancer['Focality'] = thyroid_cancer['Focality'].cat.set_categories(['Uni-Focal', 'Multi-Focal'], ordered=True)
thyroid_cancer['Risk'] = thyroid_cancer['Risk'].astype('category')
thyroid_cancer['Risk'] = thyroid_cancer['Risk'].cat.set_categories(['Low', 'Intermediate', 'High'], ordered=True)
thyroid_cancer['T'] = thyroid_cancer['T'].astype('category')
thyroid_cancer['T'] = thyroid_cancer['T'].cat.set_categories(['T1a', 'T1b', 'T2', 'T3a', 'T3b', 'T4a', 'T4b'], ordered=True)
thyroid_cancer['N'] = thyroid_cancer['N'].astype('category')
thyroid_cancer['N'] = thyroid_cancer['N'].cat.set_categories(['N0', 'N1a', 'N1b'], ordered=True)
thyroid_cancer['M'] = thyroid_cancer['M'].astype('category')
thyroid_cancer['M'] = thyroid_cancer['M'].cat.set_categories(['M0', 'M1'], ordered=True)
thyroid_cancer['Stage'] = thyroid_cancer['Stage'].astype('category')
thyroid_cancer['Stage'] = thyroid_cancer['Stage'].cat.set_categories(['I', 'II', 'III', 'IVA', 'IVB'], ordered=True)
thyroid_cancer['Response'] = thyroid_cancer['Response'].astype('category')
thyroid_cancer['Response'] = thyroid_cancer['Response'].cat.set_categories(['Excellent', 'Structural Incomplete', 'Biochemical Incomplete', 'Indeterminate'], ordered=True)
In [85]:
##################################
# Performing a general exploration of the numeric variables
##################################
print('Numeric Variable Summary:')
display(thyroid_cancer.describe(include='number').transpose())
Numeric Variable Summary:
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
Age | 383.0 | 40.866841 | 15.134494 | 15.0 | 29.0 | 37.0 | 51.0 | 82.0 |
In [86]:
##################################
# Performing a general exploration of the categorical variables
##################################
print('Categorical Variable Summary:')
display(thyroid_cancer.describe(include='category').transpose())
Categorical Variable Summary:
count | unique | top | freq | |
---|---|---|---|---|
Gender | 383 | 2 | F | 312 |
Smoking | 383 | 2 | No | 334 |
Hx_Smoking | 383 | 2 | No | 355 |
Hx_Radiotherapy | 383 | 2 | No | 376 |
Thyroid_Function | 383 | 5 | Euthyroid | 332 |
Physical_Examination | 383 | 5 | Single nodular goiter-right | 140 |
Adenopathy | 383 | 6 | No | 277 |
Pathology | 383 | 4 | Papillary | 287 |
Focality | 383 | 2 | Uni-Focal | 247 |
Risk | 383 | 3 | Low | 249 |
T | 383 | 7 | T2 | 151 |
N | 383 | 3 | N0 | 268 |
M | 383 | 2 | M0 | 365 |
Stage | 383 | 5 | I | 333 |
Response | 383 | 4 | Excellent | 208 |
Recurred | 383 | 2 | No | 275 |
In [87]:
##################################
# Performing a general exploration of the categorical variable levels
# based on the ordered categories
##################################
ordered_cat_cols = thyroid_cancer.select_dtypes(include=["category"]).columns
for col in ordered_cat_cols:
print(f"Column: {col}")
print("Absolute Frequencies:")
print(thyroid_cancer[col].value_counts().reindex(thyroid_cancer[col].cat.categories))
print("\nNormalized Frequencies:")
print(thyroid_cancer[col].value_counts(normalize=True).reindex(thyroid_cancer[col].cat.categories))
print("-" * 50)
Column: Gender Absolute Frequencies: M 71 F 312 Name: count, dtype: int64 Normalized Frequencies: M 0.185379 F 0.814621 Name: proportion, dtype: float64 -------------------------------------------------- Column: Smoking Absolute Frequencies: No 334 Yes 49 Name: count, dtype: int64 Normalized Frequencies: No 0.872063 Yes 0.127937 Name: proportion, dtype: float64 -------------------------------------------------- Column: Hx_Smoking Absolute Frequencies: No 355 Yes 28 Name: count, dtype: int64 Normalized Frequencies: No 0.926893 Yes 0.073107 Name: proportion, dtype: float64 -------------------------------------------------- Column: Hx_Radiotherapy Absolute Frequencies: No 376 Yes 7 Name: count, dtype: int64 Normalized Frequencies: No 0.981723 Yes 0.018277 Name: proportion, dtype: float64 -------------------------------------------------- Column: Thyroid_Function Absolute Frequencies: Euthyroid 332 Subclinical Hypothyroidism 14 Subclinical Hyperthyroidism 5 Clinical Hypothyroidism 12 Clinical Hyperthyroidism 20 Name: count, dtype: int64 Normalized Frequencies: Euthyroid 0.866841 Subclinical Hypothyroidism 0.036554 Subclinical Hyperthyroidism 0.013055 Clinical Hypothyroidism 0.031332 Clinical Hyperthyroidism 0.052219 Name: proportion, dtype: float64 -------------------------------------------------- Column: Physical_Examination Absolute Frequencies: Normal 7 Single nodular goiter-left 89 Single nodular goiter-right 140 Multinodular goiter 140 Diffuse goiter 7 Name: count, dtype: int64 Normalized Frequencies: Normal 0.018277 Single nodular goiter-left 0.232376 Single nodular goiter-right 0.365535 Multinodular goiter 0.365535 Diffuse goiter 0.018277 Name: proportion, dtype: float64 -------------------------------------------------- Column: Adenopathy Absolute Frequencies: No 277 Left 17 Right 48 Bilateral 32 Posterior 2 Extensive 7 Name: count, dtype: int64 Normalized Frequencies: No 0.723238 Left 0.044386 Right 0.125326 Bilateral 0.083551 Posterior 0.005222 Extensive 0.018277 Name: proportion, dtype: float64 -------------------------------------------------- Column: Pathology Absolute Frequencies: Hurthle Cell 20 Follicular 28 Micropapillary 48 Papillary 287 Name: count, dtype: int64 Normalized Frequencies: Hurthle Cell 0.052219 Follicular 0.073107 Micropapillary 0.125326 Papillary 0.749347 Name: proportion, dtype: float64 -------------------------------------------------- Column: Focality Absolute Frequencies: Uni-Focal 247 Multi-Focal 136 Name: count, dtype: int64 Normalized Frequencies: Uni-Focal 0.644909 Multi-Focal 0.355091 Name: proportion, dtype: float64 -------------------------------------------------- Column: Risk Absolute Frequencies: Low 249 Intermediate 102 High 32 Name: count, dtype: int64 Normalized Frequencies: Low 0.650131 Intermediate 0.266319 High 0.083551 Name: proportion, dtype: float64 -------------------------------------------------- Column: T Absolute Frequencies: T1a 49 T1b 43 T2 151 T3a 96 T3b 16 T4a 20 T4b 8 Name: count, dtype: int64 Normalized Frequencies: T1a 0.127937 T1b 0.112272 T2 0.394256 T3a 0.250653 T3b 0.041775 T4a 0.052219 T4b 0.020888 Name: proportion, dtype: float64 -------------------------------------------------- Column: N Absolute Frequencies: N0 268 N1a 22 N1b 93 Name: count, dtype: int64 Normalized Frequencies: N0 0.699739 N1a 0.057441 N1b 0.242820 Name: proportion, dtype: float64 -------------------------------------------------- Column: M Absolute Frequencies: M0 365 M1 18 Name: count, dtype: int64 Normalized Frequencies: M0 0.953003 M1 0.046997 Name: proportion, dtype: float64 -------------------------------------------------- Column: Stage Absolute Frequencies: I 333 II 32 III 4 IVA 3 IVB 11 Name: count, dtype: int64 Normalized Frequencies: I 0.869452 II 0.083551 III 0.010444 IVA 0.007833 IVB 0.028721 Name: proportion, dtype: float64 -------------------------------------------------- Column: Response Absolute Frequencies: Excellent 208 Structural Incomplete 91 Biochemical Incomplete 23 Indeterminate 61 Name: count, dtype: int64 Normalized Frequencies: Excellent 0.543081 Structural Incomplete 0.237598 Biochemical Incomplete 0.060052 Indeterminate 0.159269 Name: proportion, dtype: float64 -------------------------------------------------- Column: Recurred Absolute Frequencies: No 275 Yes 108 Name: count, dtype: int64 Normalized Frequencies: No 0.718016 Yes 0.281984 Name: proportion, dtype: float64 --------------------------------------------------
1.3. Data Quality Assessment ¶
Data quality findings based on assessment are as follows:
- A total of 19 duplicated rows were identified.
- In total, 34 observations were affected, consisting of 16 unique occurrences and 19 subsequent duplicates.
- These 19 duplicates spanned 16 distinct variations, meaning some variations had multiple duplicates.
- To clean the dataset, all 19 duplicate rows were removed, retaining only the first occurrence of each of the 16 unique variations.
- No missing data noted for any variable with Null.Count>0 and Fill.Rate<1.0.
- Low variance observed for 8 variables with First.Second.Mode.Ratio>5.
- Hx_Radiotherapy: First.Second.Mode.Ratio = 51.000 (comprised 2 category levels)
- M: First.Second.Mode.Ratio = 19.222 (comprised 2 category levels)
- Thyroid_Function: First.Second.Mode.Ratio = 15.650 (comprised 5 category levels)
- Hx_Smoking: First.Second.Mode.Ratio = 12.000 (comprised 2 category levels)
- Stage: First.Second.Mode.Ratio = 9.812 (comprised 5 category levels)
- Smoking: First.Second.Mode.Ratio = 6.428 (comprised 2 category levels)
- Pathology: First.Second.Mode.Ratio = 6.022 (comprised 4 category levels)
- Adenopathy: First.Second.Mode.Ratio = 5.375 (comprised 5 category levels)
- No low variance observed for any variable with Unique.Count.Ratio>10.
- No high skewness observed for any variable with Skewness>3 or Skewness<(-3).
In [88]:
##################################
# Counting the number of duplicated rows
##################################
thyroid_cancer.duplicated().sum()
Out[88]:
np.int64(19)
In [89]:
##################################
# Exploring the duplicated rows
##################################
duplicated_rows = thyroid_cancer[thyroid_cancer.duplicated(keep=False)]
display(duplicated_rows)
Age | Gender | Smoking | Hx_Smoking | Hx_Radiotherapy | Thyroid_Function | Physical_Examination | Adenopathy | Pathology | Focality | Risk | T | N | M | Stage | Response | Recurred | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
8 | 51 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Micropapillary | Uni-Focal | Low | T1a | N0 | M0 | I | Excellent | No |
9 | 40 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Micropapillary | Uni-Focal | Low | T1a | N0 | M0 | I | Excellent | No |
22 | 36 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Micropapillary | Uni-Focal | Low | T1a | N0 | M0 | I | Excellent | No |
32 | 36 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Micropapillary | Uni-Focal | Low | T1a | N0 | M0 | I | Excellent | No |
38 | 40 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Micropapillary | Uni-Focal | Low | T1a | N0 | M0 | I | Excellent | No |
40 | 51 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Micropapillary | Uni-Focal | Low | T1a | N0 | M0 | I | Excellent | No |
61 | 35 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Papillary | Uni-Focal | Low | T1b | N0 | M0 | I | Excellent | No |
66 | 35 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Papillary | Uni-Focal | Low | T1b | N0 | M0 | I | Excellent | No |
67 | 51 | F | No | No | No | Euthyroid | Single nodular goiter-left | No | Papillary | Uni-Focal | Low | T1b | N0 | M0 | I | Excellent | No |
69 | 51 | F | No | No | No | Euthyroid | Single nodular goiter-left | No | Papillary | Uni-Focal | Low | T1b | N0 | M0 | I | Excellent | No |
73 | 29 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Papillary | Uni-Focal | Low | T1b | N0 | M0 | I | Excellent | No |
77 | 29 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Papillary | Uni-Focal | Low | T1b | N0 | M0 | I | Excellent | No |
106 | 26 | F | No | No | No | Euthyroid | Multinodular goiter | No | Papillary | Uni-Focal | Low | T2 | N0 | M0 | I | Excellent | No |
110 | 31 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Papillary | Uni-Focal | Low | T2 | N0 | M0 | I | Excellent | No |
113 | 32 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Papillary | Uni-Focal | Low | T2 | N0 | M0 | I | Excellent | No |
115 | 37 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Papillary | Uni-Focal | Low | T2 | N0 | M0 | I | Excellent | No |
119 | 28 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Papillary | Uni-Focal | Low | T2 | N0 | M0 | I | Excellent | No |
120 | 37 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Papillary | Uni-Focal | Low | T2 | N0 | M0 | I | Excellent | No |
121 | 26 | F | No | No | No | Euthyroid | Multinodular goiter | No | Papillary | Uni-Focal | Low | T2 | N0 | M0 | I | Excellent | No |
123 | 28 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Papillary | Uni-Focal | Low | T2 | N0 | M0 | I | Excellent | No |
132 | 32 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Papillary | Uni-Focal | Low | T2 | N0 | M0 | I | Excellent | No |
136 | 21 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Papillary | Uni-Focal | Low | T2 | N0 | M0 | I | Excellent | No |
137 | 32 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Papillary | Uni-Focal | Low | T2 | N0 | M0 | I | Excellent | No |
138 | 26 | F | No | No | No | Euthyroid | Multinodular goiter | No | Papillary | Uni-Focal | Low | T2 | N0 | M0 | I | Excellent | No |
142 | 42 | F | No | No | No | Euthyroid | Multinodular goiter | No | Papillary | Uni-Focal | Low | T2 | N0 | M0 | I | Excellent | No |
161 | 22 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Papillary | Uni-Focal | Low | T2 | N0 | M0 | I | Excellent | No |
166 | 31 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Papillary | Uni-Focal | Low | T2 | N0 | M0 | I | Excellent | No |
168 | 21 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Papillary | Uni-Focal | Low | T2 | N0 | M0 | I | Excellent | No |
170 | 38 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Papillary | Uni-Focal | Low | T2 | N0 | M0 | I | Excellent | No |
175 | 34 | F | No | No | No | Euthyroid | Multinodular goiter | No | Papillary | Uni-Focal | Low | T2 | N0 | M0 | I | Excellent | No |
178 | 38 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Papillary | Uni-Focal | Low | T2 | N0 | M0 | I | Excellent | No |
183 | 26 | F | No | No | No | Euthyroid | Multinodular goiter | No | Papillary | Uni-Focal | Low | T2 | N0 | M0 | I | Excellent | No |
187 | 34 | F | No | No | No | Euthyroid | Multinodular goiter | No | Papillary | Uni-Focal | Low | T2 | N0 | M0 | I | Excellent | No |
189 | 42 | F | No | No | No | Euthyroid | Multinodular goiter | No | Papillary | Uni-Focal | Low | T2 | N0 | M0 | I | Excellent | No |
196 | 22 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Papillary | Uni-Focal | Low | T2 | N0 | M0 | I | Excellent | No |
In [90]:
##################################
# Checking if duplicated rows have identical values across all columns
##################################
num_unique_dup_rows = duplicated_rows.drop_duplicates().shape[0]
num_total_dup_rows = duplicated_rows.shape[0]
if num_unique_dup_rows == 1:
print("All duplicated rows have the same values across all columns.")
else:
print(f"There are {num_unique_dup_rows} unique versions among the {num_total_dup_rows} duplicated rows.")
There are 16 unique versions among the 35 duplicated rows.
In [91]:
##################################
# Counting the unique variations among duplicated rows
##################################
unique_dup_variations = duplicated_rows.drop_duplicates()
variation_counts = duplicated_rows.value_counts().reset_index(name="Count")
print("Unique duplicated row variations and their counts:")
display(variation_counts)
Unique duplicated row variations and their counts:
Age | Gender | Smoking | Hx_Smoking | Hx_Radiotherapy | Thyroid_Function | Physical_Examination | Adenopathy | Pathology | Focality | Risk | T | N | M | Stage | Response | Recurred | Count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 26 | F | No | No | No | Euthyroid | Multinodular goiter | No | Papillary | Uni-Focal | Low | T2 | N0 | M0 | I | Excellent | No | 4 |
1 | 32 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Papillary | Uni-Focal | Low | T2 | N0 | M0 | I | Excellent | No | 3 |
2 | 22 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Papillary | Uni-Focal | Low | T2 | N0 | M0 | I | Excellent | No | 2 |
3 | 21 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Papillary | Uni-Focal | Low | T2 | N0 | M0 | I | Excellent | No | 2 |
4 | 28 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Papillary | Uni-Focal | Low | T2 | N0 | M0 | I | Excellent | No | 2 |
5 | 29 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Papillary | Uni-Focal | Low | T1b | N0 | M0 | I | Excellent | No | 2 |
6 | 31 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Papillary | Uni-Focal | Low | T2 | N0 | M0 | I | Excellent | No | 2 |
7 | 34 | F | No | No | No | Euthyroid | Multinodular goiter | No | Papillary | Uni-Focal | Low | T2 | N0 | M0 | I | Excellent | No | 2 |
8 | 35 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Papillary | Uni-Focal | Low | T1b | N0 | M0 | I | Excellent | No | 2 |
9 | 36 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Micropapillary | Uni-Focal | Low | T1a | N0 | M0 | I | Excellent | No | 2 |
10 | 37 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Papillary | Uni-Focal | Low | T2 | N0 | M0 | I | Excellent | No | 2 |
11 | 38 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Papillary | Uni-Focal | Low | T2 | N0 | M0 | I | Excellent | No | 2 |
12 | 40 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Micropapillary | Uni-Focal | Low | T1a | N0 | M0 | I | Excellent | No | 2 |
13 | 42 | F | No | No | No | Euthyroid | Multinodular goiter | No | Papillary | Uni-Focal | Low | T2 | N0 | M0 | I | Excellent | No | 2 |
14 | 51 | F | No | No | No | Euthyroid | Single nodular goiter-left | No | Papillary | Uni-Focal | Low | T1b | N0 | M0 | I | Excellent | No | 2 |
15 | 51 | F | No | No | No | Euthyroid | Single nodular goiter-right | No | Micropapillary | Uni-Focal | Low | T1a | N0 | M0 | I | Excellent | No | 2 |
In [92]:
##################################
# Removing the duplicated rows and
# retaining only the first occurrence
##################################
thyroid_cancer_row_filtered = thyroid_cancer.drop_duplicates(keep="first")
print('Dataset Dimensions: ')
display(thyroid_cancer_row_filtered.shape)
Dataset Dimensions:
(364, 17)
In [93]:
##################################
# Gathering the data types for each column
##################################
data_type_list = list(thyroid_cancer_row_filtered.dtypes)
In [94]:
##################################
# Gathering the variable names for each column
##################################
variable_name_list = list(thyroid_cancer_row_filtered.columns)
In [95]:
##################################
# Gathering the number of observations for each column
##################################
row_count_list = list([len(thyroid_cancer_row_filtered)] * len(thyroid_cancer_row_filtered.columns))
In [96]:
##################################
# Gathering the number of missing data for each column
##################################
null_count_list = list(thyroid_cancer_row_filtered.isna().sum(axis=0))
In [97]:
##################################
# Gathering the number of non-missing data for each column
##################################
non_null_count_list = list(thyroid_cancer_row_filtered.count())
In [98]:
##################################
# Gathering the missing data percentage for each column
##################################
fill_rate_list = map(truediv, non_null_count_list, row_count_list)
In [99]:
##################################
# Formulating the summary
# for all columns
##################################
all_column_quality_summary = pd.DataFrame(zip(variable_name_list,
data_type_list,
row_count_list,
non_null_count_list,
null_count_list,
fill_rate_list),
columns=['Column.Name',
'Column.Type',
'Row.Count',
'Non.Null.Count',
'Null.Count',
'Fill.Rate'])
display(all_column_quality_summary)
Column.Name | Column.Type | Row.Count | Non.Null.Count | Null.Count | Fill.Rate | |
---|---|---|---|---|---|---|
0 | Age | int64 | 364 | 364 | 0 | 1.0 |
1 | Gender | category | 364 | 364 | 0 | 1.0 |
2 | Smoking | category | 364 | 364 | 0 | 1.0 |
3 | Hx_Smoking | category | 364 | 364 | 0 | 1.0 |
4 | Hx_Radiotherapy | category | 364 | 364 | 0 | 1.0 |
5 | Thyroid_Function | category | 364 | 364 | 0 | 1.0 |
6 | Physical_Examination | category | 364 | 364 | 0 | 1.0 |
7 | Adenopathy | category | 364 | 364 | 0 | 1.0 |
8 | Pathology | category | 364 | 364 | 0 | 1.0 |
9 | Focality | category | 364 | 364 | 0 | 1.0 |
10 | Risk | category | 364 | 364 | 0 | 1.0 |
11 | T | category | 364 | 364 | 0 | 1.0 |
12 | N | category | 364 | 364 | 0 | 1.0 |
13 | M | category | 364 | 364 | 0 | 1.0 |
14 | Stage | category | 364 | 364 | 0 | 1.0 |
15 | Response | category | 364 | 364 | 0 | 1.0 |
16 | Recurred | category | 364 | 364 | 0 | 1.0 |
In [100]:
##################################
# Counting the number of columns
# with Fill.Rate < 1.00
##################################
len(all_column_quality_summary[(all_column_quality_summary['Fill.Rate']<1)])
Out[100]:
0
In [101]:
##################################
# Identifying the rows
# with Fill.Rate < 0.90
##################################
column_low_fill_rate = all_column_quality_summary[(all_column_quality_summary['Fill.Rate']<0.90)]
In [102]:
##################################
# Gathering the indices for each observation
##################################
row_index_list = thyroid_cancer_row_filtered.index
In [103]:
##################################
# Gathering the number of columns for each observation
##################################
column_count_list = list([len(thyroid_cancer_row_filtered.columns)] * len(thyroid_cancer_row_filtered))
In [104]:
##################################
# Gathering the number of missing data for each row
##################################
null_row_list = list(thyroid_cancer_row_filtered.isna().sum(axis=1))
In [105]:
##################################
# Gathering the missing data percentage for each column
##################################
missing_rate_list = map(truediv, null_row_list, column_count_list)
In [106]:
##################################
# Identifying the rows
# with missing data
##################################
all_row_quality_summary = pd.DataFrame(zip(row_index_list,
column_count_list,
null_row_list,
missing_rate_list),
columns=['Row.Name',
'Column.Count',
'Null.Count',
'Missing.Rate'])
display(all_row_quality_summary)
Row.Name | Column.Count | Null.Count | Missing.Rate | |
---|---|---|---|---|
0 | 0 | 17 | 0 | 0.0 |
1 | 1 | 17 | 0 | 0.0 |
2 | 2 | 17 | 0 | 0.0 |
3 | 3 | 17 | 0 | 0.0 |
4 | 4 | 17 | 0 | 0.0 |
... | ... | ... | ... | ... |
359 | 378 | 17 | 0 | 0.0 |
360 | 379 | 17 | 0 | 0.0 |
361 | 380 | 17 | 0 | 0.0 |
362 | 381 | 17 | 0 | 0.0 |
363 | 382 | 17 | 0 | 0.0 |
364 rows × 4 columns
In [107]:
##################################
# Counting the number of rows
# with Missing.Rate > 0.00
##################################
len(all_row_quality_summary[(all_row_quality_summary['Missing.Rate']>0.00)])
Out[107]:
0
In [108]:
##################################
# Formulating the dataset
# with numeric columns only
##################################
thyroid_cancer_numeric = thyroid_cancer_row_filtered.select_dtypes(include='number')
In [109]:
##################################
# Gathering the variable names for each numeric column
##################################
numeric_variable_name_list = thyroid_cancer_numeric.columns
In [110]:
##################################
# Gathering the minimum value for each numeric column
##################################
numeric_minimum_list = thyroid_cancer_numeric.min()
In [111]:
##################################
# Gathering the mean value for each numeric column
##################################
numeric_mean_list = thyroid_cancer_numeric.mean()
In [112]:
##################################
# Gathering the median value for each numeric column
##################################
numeric_median_list = thyroid_cancer_numeric.median()
In [113]:
##################################
# Gathering the maximum value for each numeric column
##################################
numeric_maximum_list = thyroid_cancer_numeric.max()
In [114]:
##################################
# Gathering the first mode values for each numeric column
##################################
numeric_first_mode_list = [thyroid_cancer_row_filtered[x].value_counts(dropna=True).index.tolist()[0] for x in thyroid_cancer_numeric]
In [115]:
##################################
# Gathering the second mode values for each numeric column
##################################
numeric_second_mode_list = [thyroid_cancer_row_filtered[x].value_counts(dropna=True).index.tolist()[1] for x in thyroid_cancer_numeric]
In [116]:
##################################
# Gathering the count of first mode values for each numeric column
##################################
numeric_first_mode_count_list = [thyroid_cancer_numeric[x].isin([thyroid_cancer_row_filtered[x].value_counts(dropna=True).index.tolist()[0]]).sum() for x in thyroid_cancer_numeric]
In [117]:
##################################
# Gathering the count of second mode values for each numeric column
##################################
numeric_second_mode_count_list = [thyroid_cancer_numeric[x].isin([thyroid_cancer_row_filtered[x].value_counts(dropna=True).index.tolist()[1]]).sum() for x in thyroid_cancer_numeric]
In [118]:
##################################
# Gathering the first mode to second mode ratio for each numeric column
##################################
numeric_first_second_mode_ratio_list = map(truediv, numeric_first_mode_count_list, numeric_second_mode_count_list)
In [119]:
##################################
# Gathering the count of unique values for each numeric column
##################################
numeric_unique_count_list = thyroid_cancer_numeric.nunique(dropna=True)
In [120]:
##################################
# Gathering the number of observations for each numeric column
##################################
numeric_row_count_list = list([len(thyroid_cancer_numeric)] * len(thyroid_cancer_numeric.columns))
In [121]:
##################################
# Gathering the unique to count ratio for each numeric column
##################################
numeric_unique_count_ratio_list = map(truediv, numeric_unique_count_list, numeric_row_count_list)
In [122]:
##################################
# Gathering the skewness value for each numeric column
##################################
numeric_skewness_list = thyroid_cancer_numeric.skew()
In [123]:
##################################
# Gathering the kurtosis value for each numeric column
##################################
numeric_kurtosis_list = thyroid_cancer_numeric.kurtosis()
In [124]:
##################################
# Generating a column quality summary for the numeric column
##################################
numeric_column_quality_summary = pd.DataFrame(zip(numeric_variable_name_list,
numeric_minimum_list,
numeric_mean_list,
numeric_median_list,
numeric_maximum_list,
numeric_first_mode_list,
numeric_second_mode_list,
numeric_first_mode_count_list,
numeric_second_mode_count_list,
numeric_first_second_mode_ratio_list,
numeric_unique_count_list,
numeric_row_count_list,
numeric_unique_count_ratio_list,
numeric_skewness_list,
numeric_kurtosis_list),
columns=['Numeric.Column.Name',
'Minimum',
'Mean',
'Median',
'Maximum',
'First.Mode',
'Second.Mode',
'First.Mode.Count',
'Second.Mode.Count',
'First.Second.Mode.Ratio',
'Unique.Count',
'Row.Count',
'Unique.Count.Ratio',
'Skewness',
'Kurtosis'])
display(numeric_column_quality_summary)
Numeric.Column.Name | Minimum | Mean | Median | Maximum | First.Mode | Second.Mode | First.Mode.Count | Second.Mode.Count | First.Second.Mode.Ratio | Unique.Count | Row.Count | Unique.Count.Ratio | Skewness | Kurtosis | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Age | 15 | 41.25 | 38.0 | 82 | 31 | 27 | 21 | 13 | 1.615385 | 65 | 364 | 0.178571 | 0.678269 | -0.359255 |
In [125]:
##################################
# Counting the number of numeric columns
# with First.Second.Mode.Ratio > 5.00
##################################
len(numeric_column_quality_summary[(numeric_column_quality_summary['First.Second.Mode.Ratio']>5)])
Out[125]:
0
In [126]:
##################################
# Counting the number of numeric columns
# with Unique.Count.Ratio > 10.00
##################################
len(numeric_column_quality_summary[(numeric_column_quality_summary['Unique.Count.Ratio']>10)])
Out[126]:
0
In [127]:
##################################
# Counting the number of numeric columns
# with Skewness > 3.00 or Skewness < -3.00
##################################
len(numeric_column_quality_summary[(numeric_column_quality_summary['Skewness']>3) | (numeric_column_quality_summary['Skewness']<(-3))])
Out[127]:
0
In [128]:
##################################
# Formulating the dataset
# with categorical columns only
##################################
thyroid_cancer_categorical = thyroid_cancer_row_filtered.select_dtypes(include='category')
In [129]:
##################################
# Gathering the variable names for the categorical column
##################################
categorical_variable_name_list = thyroid_cancer_categorical.columns
In [130]:
##################################
# Gathering the first mode values for each categorical column
##################################
categorical_first_mode_list = [thyroid_cancer_row_filtered[x].value_counts().index.tolist()[0] for x in thyroid_cancer_categorical]
In [131]:
##################################
# Gathering the second mode values for each categorical column
##################################
categorical_second_mode_list = [thyroid_cancer_row_filtered[x].value_counts().index.tolist()[1] for x in thyroid_cancer_categorical]
In [132]:
##################################
# Gathering the count of first mode values for each categorical column
##################################
categorical_first_mode_count_list = [thyroid_cancer_categorical[x].isin([thyroid_cancer_row_filtered[x].value_counts(dropna=True).index.tolist()[0]]).sum() for x in thyroid_cancer_categorical]
In [133]:
##################################
# Gathering the count of second mode values for each categorical column
##################################
categorical_second_mode_count_list = [thyroid_cancer_categorical[x].isin([thyroid_cancer_row_filtered[x].value_counts(dropna=True).index.tolist()[1]]).sum() for x in thyroid_cancer_categorical]
In [134]:
##################################
# Gathering the first mode to second mode ratio for each categorical column
##################################
categorical_first_second_mode_ratio_list = map(truediv, categorical_first_mode_count_list, categorical_second_mode_count_list)
In [135]:
##################################
# Gathering the count of unique values for each categorical column
##################################
categorical_unique_count_list = thyroid_cancer_categorical.nunique(dropna=True)
In [136]:
##################################
# Gathering the number of observations for each categorical column
##################################
categorical_row_count_list = list([len(thyroid_cancer_categorical)] * len(thyroid_cancer_categorical.columns))
In [137]:
##################################
# Gathering the unique to count ratio for each categorical column
##################################
categorical_unique_count_ratio_list = map(truediv, categorical_unique_count_list, categorical_row_count_list)
In [138]:
##################################
# Generating a column quality summary for the categorical columns
##################################
categorical_column_quality_summary = pd.DataFrame(zip(categorical_variable_name_list,
categorical_first_mode_list,
categorical_second_mode_list,
categorical_first_mode_count_list,
categorical_second_mode_count_list,
categorical_first_second_mode_ratio_list,
categorical_unique_count_list,
categorical_row_count_list,
categorical_unique_count_ratio_list),
columns=['Categorical.Column.Name',
'First.Mode',
'Second.Mode',
'First.Mode.Count',
'Second.Mode.Count',
'First.Second.Mode.Ratio',
'Unique.Count',
'Row.Count',
'Unique.Count.Ratio'])
display(categorical_column_quality_summary)
Categorical.Column.Name | First.Mode | Second.Mode | First.Mode.Count | Second.Mode.Count | First.Second.Mode.Ratio | Unique.Count | Row.Count | Unique.Count.Ratio | |
---|---|---|---|---|---|---|---|---|---|
0 | Gender | F | M | 293 | 71 | 4.126761 | 2 | 364 | 0.005495 |
1 | Smoking | No | Yes | 315 | 49 | 6.428571 | 2 | 364 | 0.005495 |
2 | Hx_Smoking | No | Yes | 336 | 28 | 12.000000 | 2 | 364 | 0.005495 |
3 | Hx_Radiotherapy | No | Yes | 357 | 7 | 51.000000 | 2 | 364 | 0.005495 |
4 | Thyroid_Function | Euthyroid | Clinical Hyperthyroidism | 313 | 20 | 15.650000 | 5 | 364 | 0.013736 |
5 | Physical_Examination | Multinodular goiter | Single nodular goiter-right | 135 | 127 | 1.062992 | 5 | 364 | 0.013736 |
6 | Adenopathy | No | Right | 258 | 48 | 5.375000 | 6 | 364 | 0.016484 |
7 | Pathology | Papillary | Micropapillary | 271 | 45 | 6.022222 | 4 | 364 | 0.010989 |
8 | Focality | Uni-Focal | Multi-Focal | 228 | 136 | 1.676471 | 2 | 364 | 0.005495 |
9 | Risk | Low | Intermediate | 230 | 102 | 2.254902 | 3 | 364 | 0.008242 |
10 | T | T2 | T3a | 138 | 96 | 1.437500 | 7 | 364 | 0.019231 |
11 | N | N0 | N1b | 249 | 93 | 2.677419 | 3 | 364 | 0.008242 |
12 | M | M0 | M1 | 346 | 18 | 19.222222 | 2 | 364 | 0.005495 |
13 | Stage | I | II | 314 | 32 | 9.812500 | 5 | 364 | 0.013736 |
14 | Response | Excellent | Structural Incomplete | 189 | 91 | 2.076923 | 4 | 364 | 0.010989 |
15 | Recurred | No | Yes | 256 | 108 | 2.370370 | 2 | 364 | 0.005495 |
In [139]:
##################################
# Counting the number of categorical columns
# with First.Second.Mode.Ratio > 5.00
##################################
len(categorical_column_quality_summary[(categorical_column_quality_summary['First.Second.Mode.Ratio']>5)])
Out[139]:
8
In [140]:
##################################
# Identifying the categorical columns
# with First.Second.Mode.Ratio > 5.00
##################################
display(categorical_column_quality_summary[(categorical_column_quality_summary['First.Second.Mode.Ratio']>5)].sort_values(by=['First.Second.Mode.Ratio'], ascending=False))
Categorical.Column.Name | First.Mode | Second.Mode | First.Mode.Count | Second.Mode.Count | First.Second.Mode.Ratio | Unique.Count | Row.Count | Unique.Count.Ratio | |
---|---|---|---|---|---|---|---|---|---|
3 | Hx_Radiotherapy | No | Yes | 357 | 7 | 51.000000 | 2 | 364 | 0.005495 |
12 | M | M0 | M1 | 346 | 18 | 19.222222 | 2 | 364 | 0.005495 |
4 | Thyroid_Function | Euthyroid | Clinical Hyperthyroidism | 313 | 20 | 15.650000 | 5 | 364 | 0.013736 |
2 | Hx_Smoking | No | Yes | 336 | 28 | 12.000000 | 2 | 364 | 0.005495 |
13 | Stage | I | II | 314 | 32 | 9.812500 | 5 | 364 | 0.013736 |
1 | Smoking | No | Yes | 315 | 49 | 6.428571 | 2 | 364 | 0.005495 |
7 | Pathology | Papillary | Micropapillary | 271 | 45 | 6.022222 | 4 | 364 | 0.010989 |
6 | Adenopathy | No | Right | 258 | 48 | 5.375000 | 6 | 364 | 0.016484 |
In [141]:
##################################
# Counting the number of categorical columns
# with Unique.Count.Ratio > 10.00
##################################
len(categorical_column_quality_summary[(categorical_column_quality_summary['Unique.Count.Ratio']>10)])
Out[141]:
0
1.4. Data Preprocessing ¶
1.4.1 Data Splitting¶
- The baseline dataset (with duplicate rows removed from the original dataset) is comprised of:
- 364 rows (observations)
- 256 Recurred=No: 70.33%
- 108 Recurred=Yes: 29.67%
- 17 columns (variables)
- 1/17 target (categorical)
- Recurred
- 1/17 predictor (numeric)
- Age
- 15/17 predictor (categorical)
- Gender
- Smoking
- Hx_Smoking
- Hx_Radiotherapy
- Thyroid_Function
- Physical_Examination
- Adenopathy
- Pathology
- Focality
- Risk
- T
- N
- M
- Stage
- Response
- 1/17 target (categorical)
- 364 rows (observations)
- The baseline dataset was divided into three subsets using a fixed random seed:
- test data: 25% of the original data with class stratification applied
- train data (initial): 75% of the original data with class stratification applied
- train data (final): 75% of the train (initial) data with class stratification applied
- validation data: 25% of the train (initial) data with class stratification applied
- Models were developed from the train data (final). Using the same dataset, a subset of models with optimal hyperparameters were selected, based on cross-validation.
- Among candidate models with optimal hyperparameters, the final model were selected based on performance on the validation data.
- Performance of the selected final model (and other candidate models for post-model selection comparison) were evaluated using the test data.
- The train data (final) subset is comprised of:
- 204 rows (observations)
- 143 Recurred=No: 70.10%
- 61 Recurred=Yes: 29.90%
- 17 columns (variables)
- 204 rows (observations)
- The validation data subset is comprised of:
- 69 rows (observations)
- 49 Recurred=No: 71.01%
- 20 Recurred=Yes: 28.98%
- 17 columns (variables)
- 69 rows (observations)
- The test data subset is comprised of:
- 91 rows (observations)
- 64 Recurred=No: 70.33%
- 27 Recurred=Yes: 29.67%
- 17 columns (variables)
- 91 rows (observations)
In [142]:
##################################
# Creating a dataset copy
# of the row filtered data
##################################
thyroid_cancer_baseline = thyroid_cancer_row_filtered.copy()
In [143]:
##################################
# Performing a general exploration
# of the baseline dataset
##################################
print('Final Dataset Dimensions: ')
display(thyroid_cancer_baseline.shape)
Final Dataset Dimensions:
(364, 17)
In [144]:
##################################
# Obtaining the distribution of
# of the target variable
##################################
print('Target Variable Breakdown: ')
thyroid_cancer_breakdown = thyroid_cancer_baseline.groupby('Recurred', observed=True).size().reset_index(name='Count')
thyroid_cancer_breakdown['Percentage'] = (thyroid_cancer_breakdown['Count'] / len(thyroid_cancer_baseline)) * 100
display(thyroid_cancer_breakdown)
Target Variable Breakdown:
Recurred | Count | Percentage | |
---|---|---|---|
0 | No | 256 | 70.32967 |
1 | Yes | 108 | 29.67033 |
In [145]:
##################################
# Formulating the train and test data
# from the final dataset
# by applying stratification and
# using a 75-25 ratio
##################################
thyroid_cancer_train_initial, thyroid_cancer_test = train_test_split(thyroid_cancer_baseline,
test_size=0.25,
stratify=thyroid_cancer_baseline['Recurred'],
random_state=987654321)
In [146]:
##################################
# Performing a general exploration
# of the initial training dataset
##################################
X_train_initial = thyroid_cancer_train_initial.drop('Recurred', axis = 1)
y_train_initial = thyroid_cancer_train_initial['Recurred']
print('Initial Train Dataset Dimensions: ')
display(X_train_initial.shape)
display(y_train_initial.shape)
print('Initial Train Target Variable Breakdown: ')
display(y_train_initial.value_counts())
print('Initial Train Target Variable Proportion: ')
display(y_train_initial.value_counts(normalize = True))
Initial Train Dataset Dimensions:
(273, 16)
(273,)
Initial Train Target Variable Breakdown:
Recurred No 192 Yes 81 Name: count, dtype: int64
Initial Train Target Variable Proportion:
Recurred No 0.703297 Yes 0.296703 Name: proportion, dtype: float64
In [147]:
##################################
# Performing a general exploration
# of the test dataset
##################################
X_test = thyroid_cancer_test.drop('Recurred', axis = 1)
y_test = thyroid_cancer_test['Recurred']
print('Test Dataset Dimensions: ')
display(X_test.shape)
display(y_test.shape)
print('Test Target Variable Breakdown: ')
display(y_test.value_counts())
print('Test Target Variable Proportion: ')
display(y_test.value_counts(normalize = True))
Test Dataset Dimensions:
(91, 16)
(91,)
Test Target Variable Breakdown:
Recurred No 64 Yes 27 Name: count, dtype: int64
Test Target Variable Proportion:
Recurred No 0.703297 Yes 0.296703 Name: proportion, dtype: float64
In [148]:
##################################
# Formulating the train and validation data
# from the train dataset
# by applying stratification and
# using a 75-25 ratio
##################################
thyroid_cancer_train, thyroid_cancer_validation = train_test_split(thyroid_cancer_train_initial,
test_size=0.25,
stratify=thyroid_cancer_train_initial['Recurred'],
random_state=987654321)
In [149]:
##################################
# Performing a general exploration
# of the final training dataset
##################################
X_train = thyroid_cancer_train.drop('Recurred', axis = 1)
y_train = thyroid_cancer_train['Recurred']
print('Final Train Dataset Dimensions: ')
display(X_train.shape)
display(y_train.shape)
print('Final Train Target Variable Breakdown: ')
display(y_train.value_counts())
print('Final Train Target Variable Proportion: ')
display(y_train.value_counts(normalize = True))
Final Train Dataset Dimensions:
(204, 16)
(204,)
Final Train Target Variable Breakdown:
Recurred No 143 Yes 61 Name: count, dtype: int64
Final Train Target Variable Proportion:
Recurred No 0.70098 Yes 0.29902 Name: proportion, dtype: float64
In [150]:
##################################
# Performing a general exploration
# of the validation dataset
##################################
X_validation = thyroid_cancer_validation.drop('Recurred', axis = 1)
y_validation = thyroid_cancer_validation['Recurred']
print('Validation Dataset Dimensions: ')
display(X_validation.shape)
display(y_validation.shape)
print('Validation Target Variable Breakdown: ')
display(y_validation.value_counts())
print('Validation Target Variable Proportion: ')
display(y_validation.value_counts(normalize = True))
Validation Dataset Dimensions:
(69, 16)
(69,)
Validation Target Variable Breakdown:
Recurred No 49 Yes 20 Name: count, dtype: int64
Validation Target Variable Proportion:
Recurred No 0.710145 Yes 0.289855 Name: proportion, dtype: float64
In [151]:
##################################
# Saving the training data
# to the DATASETS_FINAL_TRAIN_PATH
# and DATASETS_FINAL_TRAIN_FEATURES_PATH
# and DATASETS_FINAL_TRAIN_TARGET_PATH
##################################
thyroid_cancer_train.to_csv(os.path.join("..", DATASETS_FINAL_TRAIN_PATH, "thyroid_cancer_train.csv"), index=False)
X_train.to_csv(os.path.join("..", DATASETS_FINAL_TRAIN_FEATURES_PATH, "X_train.csv"), index=False)
y_train.to_csv(os.path.join("..", DATASETS_FINAL_TRAIN_TARGET_PATH, "y_train.csv"), index=False)
In [152]:
##################################
# Saving the validation data
# to the DATASETS_FINAL_VALIDATION_PATH
# and DATASETS_FINAL_VALIDATION_FEATURE_PATH
# and DATASETS_FINAL_VALIDATION_TARGET_PATH
##################################
thyroid_cancer_validation.to_csv(os.path.join("..", DATASETS_FINAL_VALIDATION_PATH, "thyroid_cancer_validation.csv"), index=False)
X_validation.to_csv(os.path.join("..", DATASETS_FINAL_VALIDATION_FEATURES_PATH, "X_validation.csv"), index=False)
y_validation.to_csv(os.path.join("..", DATASETS_FINAL_VALIDATION_TARGET_PATH, "y_validation.csv"), index=False)
In [153]:
##################################
# Saving the test data
# to the DATASETS_FINAL_TEST_PATH
# and DATASETS_FINAL_TEST_FEATURES_PATH
# and DATASETS_FINAL_TEST_TARGET_PATH
##################################
thyroid_cancer_test.to_csv(os.path.join("..", DATASETS_FINAL_TEST_PATH, "thyroid_cancer_test.csv"), index=False)
X_test.to_csv(os.path.join("..", DATASETS_FINAL_TEST_FEATURES_PATH, "X_test.csv"), index=False)
y_test.to_csv(os.path.join("..", DATASETS_FINAL_TEST_TARGET_PATH, "y_test.csv"), index=False)