Model Deployment : Estimating Lung Cancer Probabilities From Demographic Factors, Clinical Symptoms And Behavioral Indicators¶


John Pauline Pineda

August 31, 2024


  • 1. Table of Contents
    • 1.1 Data Background
    • 1.2 Data Description
    • 1.3 Data Quality Assessment
    • 1.4 Data Preprocessing
    • 1.5 Data Exploration
      • 1.5.1 Exploratory Data Analysis
      • 1.5.2 Hypothesis Testing
    • 1.6 Predictive Model Development
      • 1.6.1 Pre-Modelling Data Preparation
      • 1.6.2 Data Splitting
      • 1.6.3 Modelling Pipeline Development
        • 1.6.3.1 Individual Classifier
        • 1.6.3.2 Stacked Classifier
      • 1.6.4 Model Fitting using Original Training Data | Hyperparameter Tuning | Validation
        • 1.6.4.1 Individual Classifier
        • 1.6.4.2 Stacked Classifier
      • 1.6.5 Model Fitting using Upsampled Training Data | Hyperparameter Tuning | Validation
        • 1.6.5.1 Individual Classifier
        • 1.6.5.2 Stacked Classifier
      • 1.6.6 Model Fitting using Downsampled Training Data | Hyperparameter Tuning | Validation
        • 1.6.6.1 Individual Classifier
        • 1.6.6.2 Stacked Classifier
      • 1.6.7 Model Selection
      • 1.6.8 Model Testing
      • 1.6.9 Model Inference
    • 1.7 Predictive Model Deployment Using Streamlit and Streamlit Community Cloud
      • 1.7.1 Model Prediction Application Code Development
      • 1.7.2 User Interface Application Code Development
      • 1.7.3 Web Application
  • 2. Summary
  • 3. References

1. Table of Contents ¶

This project implements the Logistic Regression Model as an independent learner and as a meta-learner of a stacking ensemble model with Decision Trees, Random Forest, and Support Vector Machine classifier algorithms using various helpful packages in Python to estimate probability of a dichotomous categorical response variable by modelling the relationship between one or more predictor variables and a binary outcome. The resulting predictions derived from the candidate models were evaluated using the F1 Score that ensures both false positives and false negatives are considered, providing a more balanced view of model classification performance. Resampling approaches including Synthetic Minority Oversampling Technique and Condensed Nearest Neighbors for imbalanced classification problems were applied by augmenting the dataset used for model training based on its inherent characteristics to achieve a more reasonably balanced distribution between the majority and minority classes. Additionally, Class Weights were also implemented by amplifying the loss contributed by the minority class and diminishing the loss from the majority class, forcing the model to focus more on correctly predicting the minority class. Penalties including Least Absolute Shrinkage and Selection Operator and Ridge Regularization were evaluated to impose constraints on the model coefficient updates. The final model was deployed as a prototype application with a web interface via Streamlit. All results were consolidated in a Summary presented at the end of the document.

Machine Learning Classification Models are algorithms that learn to assign predefined categories or labels to input data based on patterns and relationships identified during the training phase. Classification is a supervised learning task, meaning the models are trained on a labeled dataset where the correct output (class or label) is known for each input. Once trained, these models can predict the class of new, unseen instances.

Binary Classification Learning refers to a predictive modelling problem where only two class labels are predicted for a given sample of input data. These models use the training data set and calculate how to best map instances of input data to the specific class labels. Typically, binary classification tasks involve one class that is the normal state (assigned the class label 0) and another class that is the abnormal state (assigned the class label 1). It is common to structure a binary classification task with a model that predicts a Bernoulli probability distribution for each instance. The Bernoulli distribution is a discrete probability distribution that covers a case where an event will have a binary outcome as either a 0 or 1. For a binary classification, this means that the model predicts a probability of an instance belonging to class 1, or the abnormal state.

Imbalanced Class Learning refers to the process of building and training models to predict a dichotomous categorical response in scenarios where the two classes are not equally represented in the dataset. This imbalance can cause challenges in training machine learning models, leading to biased predictions that favor the majority class or misleading estimation of model performance using the accuracy metric. Several strategies can be employed to effectively handle class imbalance including resampling, class weighting, cost-sensitive learning, and choosing appropriate metrics. in effect, models can be trained to perform well on both the minority and majority classes, ensuring more reliable and fair predictions.

Regularization Methods, in the context of binary classification using Logistic Regression, are primarily used to prevent overfitting and improve the model's generalization to new data. Overfitting occurs when a model is too complex and learns not only the underlying pattern in the data but also the noise. This leads to poor performance on unseen data. Regularization introduces a penalty for large coefficients in the model, which helps in controlling the model complexity. In Logistic Regression, this is done by adding a regularization term to the loss function, which penalizes large values of the coefficients. This forces the model to keep the coefficients small, thereby reducing the likelihood of overfitting. Addiitonally, by penalizing the complexity of the model through the regularization term, regularization methods also help the model generalize better to unseen data. This is because the model is less likely to overfit the training data and more likely to capture the true underlying pattern.

Streamlit is an open-source Python library that simplifies the creation and deployment of web applications for machine learning and data science projects. It allows developers and data scientists to turn Python scripts into interactive web apps quickly without requiring extensive web development knowledge. Streamlit seamlessly integrates with popular Python libraries such as Pandas, Matplotlib, Plotly, and TensorFlow, allowing one to leverage existing data processing and visualization tools within the application. Streamlit apps can be easily deployed on various platforms, including Streamlit Community Cloud, Heroku, or any cloud service that supports Python web applications.

Streamlit Community Cloud, formerly known as Streamlit Sharing, is a free cloud-based platform provided by Streamlit that allows users to easily deploy and share Streamlit apps online. It is particularly popular among data scientists, machine learning engineers, and developers for quickly showcasing projects, creating interactive demos, and sharing data-driven applications with a wider audience without needing to manage server infrastructure. Significant features include free hosting (Streamlit Community Cloud provides free hosting for Streamlit apps, making it accessible for users who want to share their work without incurring hosting costs), easy deployment (users can connect their GitHub repository to Streamlit Community Cloud, and the app is automatically deployed from the repository), continuous deployment (if the code in the connected GitHub repository is updated, the app is automatically redeployed with the latest changes), sharing capabilities (once deployed, apps can be shared with others via a simple URL, making it easy for collaborators, stakeholders, or the general public to access and interact with the app), built-in authentication (users can restrict access to their apps using GitHub-based authentication, allowing control over who can view and interact with the app), and community support (the platform is supported by a community of users and developers who share knowledge, templates, and best practices for building and deploying Streamlit apps).

1.1. Data Background ¶

An open Lung Cancer Dataset from Kaggle (with all credits attributed to Nancy Al Aswad) was used for the analysis as consolidated from the following primary source:

  1. Research Paper entitled Optimal Discriminant Plane for a Small Number of Samples and Design Method of Classifier on the Plane from the Pattern Recognition Journal

This study hypothesized that demographic factors, clinical symptoms, and behavioral indicators influence lung cancer probabilities between patients.

The dichotomous categorical variable for the study is:

  • LUNG_CANCER - Lung cancer status of the patient (YES, lung cancer cases | NO, non-lung cancer case)

The predictor variables for the study are:

  • GENDER - Patient's sex (M, Male | F, Female)
  • AGE - Patient's age (Years)
  • SMOKING - Behavioral indication of smoking (1, Absent | 2, Present)
  • YELLOW_FINGERS - Clinical symptom of yellowing of fingers (1, Absent | 2, Present)
  • ANXIETY - Behavioral indication of experiencing anxiety (1, Absent | 2, Present)
  • PEER_PRESSURE - Behavioral indication of experiencing peer pressure (1, Absent | 2, Present)
  • CHRONIC_DISEASE - Clinical symptom of chronic diseases (1, Absent | 2, Present)
  • FATIGUE - Clinical symptom of chronic fatigue (1, Absent | 2, Present)
  • ALLERGY - Clinical symptom of allergies (1, Absent | 2, Present)
  • WHEEZING - Clinical symptom of wheezing (1, Absent | 2, Present)
  • ALCOHOL_CONSUMING - Behavioral indication of consuming alcohol (1, Absent | 2, Present)
  • COUGHING - Clinical symptom of wheezing (1, Absent | 2, Present)
  • SHORTNESS_OF_BREATH - Clinical symptom of shortness of breath (1, Absent | 2, Present)
  • SWALLOWING_DIFFICULTY - Clinical symptom of difficulty in swallowing (1, Absent | 2, Present)
  • CHEST_PAIN - Clinical symptom of chest pain (1, Absent | 2, Present)

1.2. Data Description ¶

  1. The dataset is comprised of:
    • 309 rows (observations)
    • 16 columns (variables)
      • 1/16 target (categorical)
        • LUNG_CANCER
      • 1/16 predictor (numeric)
        • AGE
      • 14/16 predictors (categorical)
        • GENDER
        • SMOKING
        • YELLOW_FINGERS
        • ANXIETY
        • PEER_PRESSURE
        • CHRONIC_DISEASE
        • FATIGUE
        • ALLERGY
        • WHEEZING
        • ALCOHOL_CONSUMING
        • COUGHING
        • SHORTNESS_OF_BREATH
        • SWALLOWING_DIFFICULTY
        • CHEST_PAIN
In [1]:
##################################
# Loading Python Libraries
##################################
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import os
import itertools
import joblib
%matplotlib inline

from operator import add,mul,truediv
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PowerTransformer, StandardScaler
from scipy import stats
from scipy.stats import pointbiserialr

from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, StackingClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix, ConfusionMatrixDisplay, classification_report
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score

from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import CondensedNearestNeighbour
In [2]:
##################################
# Defining file paths
##################################
DATASETS_ORIGINAL_PATH = r"datasets\original"
DATASETS_PREPROCESSED_PATH = r"datasets\preprocessed"
DATASETS_FINAL_PATH = r"datasets\final\complete"
DATASETS_FINAL_TRAIN_PATH = r"datasets\final\train"
DATASETS_FINAL_TRAIN_FEATURES_PATH = r"datasets\final\train\features"
DATASETS_FINAL_TRAIN_TARGET_PATH = r"datasets\final\train\target"
DATASETS_FINAL_VALIDATION_PATH = r"datasets\final\validation"
DATASETS_FINAL_VALIDATION_FEATURES_PATH = r"datasets\final\validation\features"
DATASETS_FINAL_VALIDATION_TARGET_PATH = r"datasets\final\validation\target"
DATASETS_FINAL_TEST_PATH = r"datasets\final\test"
DATASETS_FINAL_TEST_FEATURES_PATH = r"datasets\final\test\features"
DATASETS_FINAL_TEST_TARGET_PATH = r"datasets\final\test\target"
MODELS_PATH = r"models"
In [3]:
##################################
# Loading the dataset
# from the DATASETS_ORIGINAL_PATH
##################################
lung_cancer = pd.read_csv(os.path.join("..", DATASETS_ORIGINAL_PATH, "lung_cancer.csv"))
In [4]:
##################################
# Performing a general exploration of the dataset
##################################
print('Dataset Dimensions: ')
display(lung_cancer.shape)
Dataset Dimensions: 
(309, 16)
In [5]:
##################################
# Verifying the column names
##################################
print('Column Names: ')
display(lung_cancer.columns)
Column Names: 
Index(['GENDER', 'AGE', 'SMOKING', 'YELLOW_FINGERS', 'ANXIETY',
       'PEER_PRESSURE', 'CHRONIC DISEASE', 'FATIGUE ', 'ALLERGY ', 'WHEEZING',
       'ALCOHOL CONSUMING', 'COUGHING', 'SHORTNESS OF BREATH',
       'SWALLOWING DIFFICULTY', 'CHEST PAIN', 'LUNG_CANCER'],
      dtype='object')
In [6]:
##################################
# Removing trailing white spaces
# in column names
##################################
lung_cancer.columns = [x.strip() for x in lung_cancer.columns]
In [7]:
##################################
# Standardizing the column names
##################################
lung_cancer.columns = ['GENDER', 
                       'AGE', 
                       'SMOKING', 
                       'YELLOW_FINGERS', 
                       'ANXIETY',
                       'PEER_PRESSURE', 
                       'CHRONIC_DISEASE', 
                       'FATIGUE', 
                       'ALLERGY', 
                       'WHEEZING',
                       'ALCOHOL_CONSUMING', 
                       'COUGHING', 
                       'SHORTNESS_OF_BREATH',
                       'SWALLOWING_DIFFICULTY', 
                       'CHEST_PAIN', 
                       'LUNG_CANCER']
In [8]:
##################################
# Verifying the corrected column names
##################################
print('Column Names: ')
display(lung_cancer.columns)
Column Names: 
Index(['GENDER', 'AGE', 'SMOKING', 'YELLOW_FINGERS', 'ANXIETY',
       'PEER_PRESSURE', 'CHRONIC_DISEASE', 'FATIGUE', 'ALLERGY', 'WHEEZING',
       'ALCOHOL_CONSUMING', 'COUGHING', 'SHORTNESS_OF_BREATH',
       'SWALLOWING_DIFFICULTY', 'CHEST_PAIN', 'LUNG_CANCER'],
      dtype='object')
In [9]:
##################################
# Listing the column names and data types
##################################
print('Column Names and Data Types:')
display(lung_cancer.dtypes)
Column Names and Data Types:
GENDER                   object
AGE                       int64
SMOKING                   int64
YELLOW_FINGERS            int64
ANXIETY                   int64
PEER_PRESSURE             int64
CHRONIC_DISEASE           int64
FATIGUE                   int64
ALLERGY                   int64
WHEEZING                  int64
ALCOHOL_CONSUMING         int64
COUGHING                  int64
SHORTNESS_OF_BREATH       int64
SWALLOWING_DIFFICULTY     int64
CHEST_PAIN                int64
LUNG_CANCER              object
dtype: object
In [10]:
##################################
# Taking a snapshot of the dataset
##################################
lung_cancer.head()
Out[10]:
GENDER AGE SMOKING YELLOW_FINGERS ANXIETY PEER_PRESSURE CHRONIC_DISEASE FATIGUE ALLERGY WHEEZING ALCOHOL_CONSUMING COUGHING SHORTNESS_OF_BREATH SWALLOWING_DIFFICULTY CHEST_PAIN LUNG_CANCER
0 M 69 1 2 2 1 1 2 1 2 2 2 2 2 2 YES
1 M 74 2 1 1 1 2 2 2 1 1 1 2 2 2 YES
2 F 59 1 1 1 2 1 2 1 2 1 2 2 1 2 NO
3 M 63 2 2 2 1 1 1 1 1 2 1 1 2 2 NO
4 F 63 1 2 1 1 1 1 1 2 1 2 2 1 1 NO
In [11]:
##################################
# Setting the levels of the dichotomous categorical variables
# to boolean values
##################################
lung_cancer[['GENDER','LUNG_CANCER']] = lung_cancer[['GENDER','LUNG_CANCER']].astype('category')
lung_cancer['GENDER'] = lung_cancer['GENDER'].cat.set_categories(['F', 'M'], ordered=True)
lung_cancer['LUNG_CANCER'] = lung_cancer['LUNG_CANCER'].cat.set_categories(['NO', 'YES'], ordered=True)
int_columns = ['SMOKING',
               'YELLOW_FINGERS', 
               'ANXIETY',
               'PEER_PRESSURE', 
               'CHRONIC_DISEASE', 
               'FATIGUE', 
               'ALLERGY', 
               'WHEEZING',
               'ALCOHOL_CONSUMING', 
               'COUGHING', 
               'SHORTNESS_OF_BREATH',
               'SWALLOWING_DIFFICULTY', 
               'CHEST_PAIN', 
               'LUNG_CANCER']
lung_cancer[int_columns] = lung_cancer[int_columns].astype(object)
lung_cancer[int_columns] = lung_cancer[int_columns].replace({1: 'Absent', 2: 'Present'})
In [12]:
##################################
# Listing the column names and data types
##################################
print('Column Names and Data Types:')
display(lung_cancer.dtypes)
Column Names and Data Types:
GENDER                   category
AGE                         int64
SMOKING                    object
YELLOW_FINGERS             object
ANXIETY                    object
PEER_PRESSURE              object
CHRONIC_DISEASE            object
FATIGUE                    object
ALLERGY                    object
WHEEZING                   object
ALCOHOL_CONSUMING          object
COUGHING                   object
SHORTNESS_OF_BREATH        object
SWALLOWING_DIFFICULTY      object
CHEST_PAIN                 object
LUNG_CANCER                object
dtype: object
In [13]:
##################################
# Taking a snapshot of the dataset
##################################
lung_cancer.head()
Out[13]:
GENDER AGE SMOKING YELLOW_FINGERS ANXIETY PEER_PRESSURE CHRONIC_DISEASE FATIGUE ALLERGY WHEEZING ALCOHOL_CONSUMING COUGHING SHORTNESS_OF_BREATH SWALLOWING_DIFFICULTY CHEST_PAIN LUNG_CANCER
0 M 69 Absent Present Present Absent Absent Present Absent Present Present Present Present Present Present YES
1 M 74 Present Absent Absent Absent Present Present Present Absent Absent Absent Present Present Present YES
2 F 59 Absent Absent Absent Present Absent Present Absent Present Absent Present Present Absent Present NO
3 M 63 Present Present Present Absent Absent Absent Absent Absent Present Absent Absent Present Present NO
4 F 63 Absent Present Absent Absent Absent Absent Absent Present Absent Present Present Absent Absent NO
In [14]:
##################################
# Performing a general exploration 
# of the numeric variables
##################################
print('Numeric Variable Summary:')
display(lung_cancer.describe(include='number').transpose())
Numeric Variable Summary:
count mean std min 25% 50% 75% max
AGE 309.0 62.673139 8.210301 21.0 57.0 62.0 69.0 87.0
In [15]:
##################################
# Performing a general exploration 
# of the object and categorical variables
##################################
print('Categorical Variable Summary:')
display(lung_cancer.describe(include=['category','object']).transpose())
Categorical Variable Summary:
count unique top freq
GENDER 309 2 M 162
SMOKING 309 2 Present 174
YELLOW_FINGERS 309 2 Present 176
ANXIETY 309 2 Absent 155
PEER_PRESSURE 309 2 Present 155
CHRONIC_DISEASE 309 2 Present 156
FATIGUE 309 2 Present 208
ALLERGY 309 2 Present 172
WHEEZING 309 2 Present 172
ALCOHOL_CONSUMING 309 2 Present 172
COUGHING 309 2 Present 179
SHORTNESS_OF_BREATH 309 2 Present 198
SWALLOWING_DIFFICULTY 309 2 Absent 164
CHEST_PAIN 309 2 Present 172
LUNG_CANCER 309 2 YES 270

1.3. Data Quality Assessment ¶

Data quality findings based on assessment are as follows:

  1. 33 duplicated rows observed. These cases were not removed considering that most variables are dichotomous categorical where duplicate values might be possible.
  2. No missing data noted for any variable with Null.Count>0 and Fill.Rate<1.0.
  3. No low variance observed for the numeric predictor with First.Second.Mode.Ratio>5.
  4. No low variance observed for the numeric and categorical predictors with Unique.Count.Ratio>5.
  5. Low variance observed for the target variable with Unique.Count.Ratio>5 indicating class imbalance that needs to be addressed for the downstream modelling process.
    • LUNG_CANCER: Unique.Count.Ratio = +6.923
  6. No high skewness observed for the numeric predictor with Skewness>3 or Skewness<(-3).
In [16]:
##################################
# Counting the number of duplicated rows
##################################
lung_cancer.duplicated().sum()
Out[16]:
np.int64(33)
In [17]:
##################################
# Displaying the duplicated rows
##################################
lung_cancer[lung_cancer.duplicated()]
Out[17]:
GENDER AGE SMOKING YELLOW_FINGERS ANXIETY PEER_PRESSURE CHRONIC_DISEASE FATIGUE ALLERGY WHEEZING ALCOHOL_CONSUMING COUGHING SHORTNESS_OF_BREATH SWALLOWING_DIFFICULTY CHEST_PAIN LUNG_CANCER
99 M 56 Present Absent Absent Absent Absent Present Present Present Present Present Present Absent Present YES
100 M 58 Present Absent Absent Absent Absent Absent Present Present Present Present Absent Absent Absent YES
117 F 51 Present Present Present Present Absent Present Present Absent Absent Absent Present Present Absent YES
199 F 55 Present Absent Absent Present Present Present Present Present Present Absent Absent Present Present YES
212 M 58 Present Absent Absent Absent Absent Present Present Present Present Present Present Absent Present YES
223 M 63 Present Present Present Absent Present Present Present Present Absent Absent Present Absent Absent YES
256 M 60 Present Absent Absent Absent Absent Present Present Present Present Present Present Absent Present YES
275 M 64 Present Present Present Present Present Absent Absent Absent Present Absent Absent Present Present YES
284 M 58 Present Present Present Present Present Absent Absent Absent Present Absent Absent Present Present YES
285 F 58 Present Present Present Present Absent Present Absent Absent Absent Present Present Present Absent YES
286 F 63 Absent Absent Absent Absent Present Present Absent Absent Absent Absent Present Absent Absent NO
287 F 51 Present Present Present Present Absent Present Absent Absent Absent Absent Present Present Absent YES
288 F 61 Absent Present Present Present Absent Absent Present Present Absent Present Absent Present Absent YES
289 F 61 Present Absent Absent Absent Present Present Present Absent Absent Absent Present Absent Absent YES
290 M 76 Present Absent Absent Absent Absent Present Present Present Present Present Present Absent Present YES
291 M 71 Present Present Present Absent Present Absent Present Present Present Present Absent Present Present YES
292 M 69 Absent Absent Present Absent Absent Present Absent Present Present Present Present Present Absent YES
293 F 56 Present Present Present Absent Absent Present Present Absent Absent Absent Present Absent Present YES
294 M 67 Absent Absent Absent Present Absent Present Absent Present Absent Present Present Absent Present YES
295 F 54 Present Present Present Absent Present Absent Absent Present Present Absent Present Present Present YES
296 M 63 Absent Present Absent Absent Absent Present Absent Present Present Present Present Absent Absent YES
297 F 47 Present Present Absent Present Present Present Present Present Absent Present Present Absent Absent YES
298 M 62 Present Absent Present Absent Absent Present Absent Present Present Present Present Absent Present YES
299 M 65 Present Present Present Present Absent Present Present Absent Absent Absent Present Present Absent YES
300 F 63 Present Present Present Present Present Present Present Present Absent Present Present Present Present YES
301 M 64 Absent Present Present Present Absent Absent Present Absent Present Absent Absent Present Present YES
302 F 65 Present Present Present Present Absent Present Absent Present Absent Present Present Present Absent YES
303 M 51 Absent Present Absent Absent Present Present Present Present Present Present Present Absent Present YES
304 F 56 Absent Absent Absent Present Present Present Absent Absent Present Present Present Present Absent YES
305 M 70 Present Absent Absent Absent Absent Present Present Present Present Present Present Absent Present YES
306 M 58 Present Absent Absent Absent Absent Absent Present Present Present Present Absent Absent Present YES
307 M 67 Present Absent Present Absent Absent Present Present Absent Present Present Present Absent Present YES
308 M 62 Absent Absent Absent Present Absent Present Present Present Present Absent Absent Present Absent YES
In [18]:
##################################
# Gathering the data types for each column
##################################
data_type_list = list(lung_cancer.dtypes)
In [19]:
##################################
# Gathering the variable names for each column
##################################
variable_name_list = list(lung_cancer.columns)
In [20]:
##################################
# Gathering the number of observations for each column
##################################
row_count_list = list([len(lung_cancer)] * len(lung_cancer.columns))
In [21]:
##################################
# Gathering the number of missing data for each column
##################################
null_count_list = list(lung_cancer.isna().sum(axis=0))
In [22]:
##################################
# Gathering the number of non-missing data for each column
##################################
non_null_count_list = list(lung_cancer.count())
In [23]:
##################################
# Gathering the missing data percentage for each column
##################################
fill_rate_list = map(truediv, non_null_count_list, row_count_list)
In [24]:
##################################
# Formulating the summary
# for all columns
##################################
all_column_quality_summary = pd.DataFrame(zip(variable_name_list,
                                              data_type_list,
                                              row_count_list,
                                              non_null_count_list,
                                              null_count_list,
                                              fill_rate_list), 
                                        columns=['Column.Name',
                                                 'Column.Type',
                                                 'Row.Count',
                                                 'Non.Null.Count',
                                                 'Null.Count',                                                 
                                                 'Fill.Rate'])
display(all_column_quality_summary)
Column.Name Column.Type Row.Count Non.Null.Count Null.Count Fill.Rate
0 GENDER category 309 309 0 1.0
1 AGE int64 309 309 0 1.0
2 SMOKING object 309 309 0 1.0
3 YELLOW_FINGERS object 309 309 0 1.0
4 ANXIETY object 309 309 0 1.0
5 PEER_PRESSURE object 309 309 0 1.0
6 CHRONIC_DISEASE object 309 309 0 1.0
7 FATIGUE object 309 309 0 1.0
8 ALLERGY object 309 309 0 1.0
9 WHEEZING object 309 309 0 1.0
10 ALCOHOL_CONSUMING object 309 309 0 1.0
11 COUGHING object 309 309 0 1.0
12 SHORTNESS_OF_BREATH object 309 309 0 1.0
13 SWALLOWING_DIFFICULTY object 309 309 0 1.0
14 CHEST_PAIN object 309 309 0 1.0
15 LUNG_CANCER object 309 309 0 1.0
In [25]:
##################################
# Counting the number of columns
# with Fill.Rate < 1.00
##################################
print('Number of Columns with Missing Data:', str(len(all_column_quality_summary[(all_column_quality_summary['Fill.Rate']<1)])))
Number of Columns with Missing Data: 0
In [26]:
##################################
# Identifying the rows
# with Fill.Rate < 1.00
##################################
column_low_fill_rate = all_column_quality_summary[(all_column_quality_summary['Fill.Rate']<1.00)]
In [27]:
##################################
# Gathering the metadata labels for each observation
##################################
row_metadata_list = lung_cancer.index.values.tolist()
In [28]:
##################################
# Gathering the number of columns for each observation
##################################
column_count_list = list([len(lung_cancer.columns)] * len(lung_cancer))
In [29]:
##################################
# Gathering the number of missing data for each row
##################################
null_row_list = list(lung_cancer.isna().sum(axis=1))
In [30]:
##################################
# Gathering the missing data percentage for each column
##################################
missing_rate_list = map(truediv, null_row_list, column_count_list)
In [31]:
##################################
# Exploring the rows
# for missing data
##################################
all_row_quality_summary = pd.DataFrame(zip(row_metadata_list,
                                           column_count_list,
                                           null_row_list,
                                           missing_rate_list), 
                                        columns=['Row.Name',
                                                 'Column.Count',
                                                 'Null.Count',                                                 
                                                 'Missing.Rate'])
display(all_row_quality_summary)
Row.Name Column.Count Null.Count Missing.Rate
0 0 16 0 0.0
1 1 16 0 0.0
2 2 16 0 0.0
3 3 16 0 0.0
4 4 16 0 0.0
... ... ... ... ...
304 304 16 0 0.0
305 305 16 0 0.0
306 306 16 0 0.0
307 307 16 0 0.0
308 308 16 0 0.0

309 rows × 4 columns

In [32]:
##################################
# Counting the number of rows
# with Fill.Rate < 1.00
##################################
print('Number of Rows with Missing Data:',str(len(all_row_quality_summary[all_row_quality_summary['Missing.Rate']>0])))
Number of Rows with Missing Data: 0
In [33]:
##################################
# Formulating the dataset
# with numeric columns only
##################################
lung_cancer_numeric = lung_cancer.select_dtypes(include=['number','int'])
In [34]:
##################################
# Gathering the variable names for each numeric column
##################################
numeric_variable_name_list = lung_cancer_numeric.columns
In [35]:
##################################
# Gathering the minimum value for each numeric column
##################################
numeric_minimum_list = lung_cancer_numeric.min()
In [36]:
##################################
# Gathering the mean value for each numeric column
##################################
numeric_mean_list = lung_cancer_numeric.mean()
In [37]:
##################################
# Gathering the median value for each numeric column
##################################
numeric_median_list = lung_cancer_numeric.median()
In [38]:
##################################
# Gathering the maximum value for each numeric column
##################################
numeric_maximum_list = lung_cancer_numeric.max()
In [39]:
##################################
# Gathering the first mode values for each numeric column
##################################
numeric_first_mode_list = [lung_cancer[x].value_counts(dropna=True).index.tolist()[0] for x in lung_cancer_numeric]
In [40]:
##################################
# Gathering the second mode values for each numeric column
##################################
numeric_second_mode_list = [lung_cancer[x].value_counts(dropna=True).index.tolist()[1] for x in lung_cancer_numeric]
In [41]:
##################################
# Gathering the count of first mode values for each numeric column
##################################
numeric_first_mode_count_list = [lung_cancer_numeric[x].isin([lung_cancer[x].value_counts(dropna=True).index.tolist()[0]]).sum() for x in lung_cancer_numeric]
In [42]:
##################################
# Gathering the count of second mode values for each numeric column
##################################
numeric_second_mode_count_list = [lung_cancer_numeric[x].isin([lung_cancer[x].value_counts(dropna=True).index.tolist()[1]]).sum() for x in lung_cancer_numeric]
In [43]:
##################################
# Gathering the first mode to second mode ratio for each numeric column
##################################
numeric_first_second_mode_ratio_list = map(truediv, numeric_first_mode_count_list, numeric_second_mode_count_list)
In [44]:
##################################
# Gathering the count of unique values for each numeric column
##################################
numeric_unique_count_list = lung_cancer_numeric.nunique(dropna=True)
In [45]:
##################################
# Gathering the number of observations for each numeric column
##################################
numeric_row_count_list = list([len(lung_cancer_numeric)] * len(lung_cancer_numeric.columns))
In [46]:
##################################
# Gathering the unique to count ratio for each numeric column
##################################
numeric_unique_count_ratio_list = map(truediv, numeric_unique_count_list, numeric_row_count_list)
In [47]:
##################################
# Gathering the skewness value for each numeric column
##################################
numeric_skewness_list = lung_cancer_numeric.skew()
In [48]:
##################################
# Gathering the kurtosis value for each numeric column
##################################
numeric_kurtosis_list = lung_cancer_numeric.kurtosis()
In [49]:
numeric_column_quality_summary = pd.DataFrame(zip(numeric_variable_name_list,
                                                numeric_minimum_list,
                                                numeric_mean_list,
                                                numeric_median_list,
                                                numeric_maximum_list,
                                                numeric_first_mode_list,
                                                numeric_second_mode_list,
                                                numeric_first_mode_count_list,
                                                numeric_second_mode_count_list,
                                                numeric_first_second_mode_ratio_list,
                                                numeric_unique_count_list,
                                                numeric_row_count_list,
                                                numeric_unique_count_ratio_list,
                                                numeric_skewness_list,
                                                numeric_kurtosis_list), 
                                        columns=['Numeric.Column.Name',
                                                 'Minimum',
                                                 'Mean',
                                                 'Median',
                                                 'Maximum',
                                                 'First.Mode',
                                                 'Second.Mode',
                                                 'First.Mode.Count',
                                                 'Second.Mode.Count',
                                                 'First.Second.Mode.Ratio',
                                                 'Unique.Count',
                                                 'Row.Count',
                                                 'Unique.Count.Ratio',
                                                 'Skewness',
                                                 'Kurtosis'])
display(numeric_column_quality_summary)
Numeric.Column.Name Minimum Mean Median Maximum First.Mode Second.Mode First.Mode.Count Second.Mode.Count First.Second.Mode.Ratio Unique.Count Row.Count Unique.Count.Ratio Skewness Kurtosis
0 AGE 21 62.673139 62.0 87 64 56 20 19 1.052632 39 309 0.126214 -0.395086 1.746558
In [50]:
##################################
# Counting the number of numeric columns
# with First.Second.Mode.Ratio > 5.00
##################################
len(numeric_column_quality_summary[(numeric_column_quality_summary['First.Second.Mode.Ratio']>5)])
Out[50]:
0
In [51]:
##################################
# Counting the number of numeric columns
# with Unique.Count.Ratio > 10.00
##################################
len(numeric_column_quality_summary[(numeric_column_quality_summary['Unique.Count.Ratio']>10)])
Out[51]:
0
In [52]:
##################################
# Counting the number of numeric columns
# with Skewness > 3.00 or Skewness < -3.00
##################################
len(numeric_column_quality_summary[(numeric_column_quality_summary['Skewness']>3) | (numeric_column_quality_summary['Skewness']<(-3))])
Out[52]:
0
In [53]:
##################################
# Formulating the dataset
# with object or categorical column only
##################################
lung_cancer_object = lung_cancer.select_dtypes(include=['object','category'])
In [54]:
##################################
# Gathering the variable names for the object or categorical column
##################################
categorical_variable_name_list = lung_cancer_object.columns
In [55]:
##################################
# Gathering the first mode values for the object or categorical column
##################################
categorical_first_mode_list = [lung_cancer[x].value_counts().index.tolist()[0] for x in lung_cancer_object]
In [56]:
##################################
# Gathering the second mode values for each object or categorical column
##################################
categorical_second_mode_list = [lung_cancer[x].value_counts().index.tolist()[1] for x in lung_cancer_object]
In [57]:
##################################
# Gathering the count of first mode values for each object or categorical column
##################################
categorical_first_mode_count_list = [lung_cancer_object[x].isin([lung_cancer[x].value_counts(dropna=True).index.tolist()[0]]).sum() for x in lung_cancer_object]
In [58]:
##################################
# Gathering the count of second mode values for each object or categorical column
##################################
categorical_second_mode_count_list = [lung_cancer_object[x].isin([lung_cancer[x].value_counts(dropna=True).index.tolist()[1]]).sum() for x in lung_cancer_object]
In [59]:
##################################
# Gathering the first mode to second mode ratio for each object or categorical column
##################################
categorical_first_second_mode_ratio_list = map(truediv, categorical_first_mode_count_list, categorical_second_mode_count_list)
In [60]:
##################################
# Gathering the count of unique values for each object or categorical column
##################################
categorical_unique_count_list = lung_cancer_object.nunique(dropna=True)
In [61]:
##################################
# Gathering the number of observations for each object or categorical column
##################################
categorical_row_count_list = list([len(lung_cancer_object)] * len(lung_cancer_object.columns))
In [62]:
##################################
# Gathering the unique to count ratio for each object or categorical column
##################################
categorical_unique_count_ratio_list = map(truediv, categorical_unique_count_list, categorical_row_count_list)
In [63]:
categorical_column_quality_summary = pd.DataFrame(zip(categorical_variable_name_list,
                                                 categorical_first_mode_list,
                                                 categorical_second_mode_list,
                                                 categorical_first_mode_count_list,
                                                 categorical_second_mode_count_list,
                                                 categorical_first_second_mode_ratio_list,
                                                 categorical_unique_count_list,
                                                 categorical_row_count_list,
                                                 categorical_unique_count_ratio_list), 
                                        columns=['Categorical.Column.Name',
                                                 'First.Mode',
                                                 'Second.Mode',
                                                 'First.Mode.Count',
                                                 'Second.Mode.Count',
                                                 'First.Second.Mode.Ratio',
                                                 'Unique.Count',
                                                 'Row.Count',
                                                 'Unique.Count.Ratio'])
display(categorical_column_quality_summary)
Categorical.Column.Name First.Mode Second.Mode First.Mode.Count Second.Mode.Count First.Second.Mode.Ratio Unique.Count Row.Count Unique.Count.Ratio
0 GENDER M F 162 147 1.102041 2 309 0.006472
1 SMOKING Present Absent 174 135 1.288889 2 309 0.006472
2 YELLOW_FINGERS Present Absent 176 133 1.323308 2 309 0.006472
3 ANXIETY Absent Present 155 154 1.006494 2 309 0.006472
4 PEER_PRESSURE Present Absent 155 154 1.006494 2 309 0.006472
5 CHRONIC_DISEASE Present Absent 156 153 1.019608 2 309 0.006472
6 FATIGUE Present Absent 208 101 2.059406 2 309 0.006472
7 ALLERGY Present Absent 172 137 1.255474 2 309 0.006472
8 WHEEZING Present Absent 172 137 1.255474 2 309 0.006472
9 ALCOHOL_CONSUMING Present Absent 172 137 1.255474 2 309 0.006472
10 COUGHING Present Absent 179 130 1.376923 2 309 0.006472
11 SHORTNESS_OF_BREATH Present Absent 198 111 1.783784 2 309 0.006472
12 SWALLOWING_DIFFICULTY Absent Present 164 145 1.131034 2 309 0.006472
13 CHEST_PAIN Present Absent 172 137 1.255474 2 309 0.006472
14 LUNG_CANCER YES NO 270 39 6.923077 2 309 0.006472
In [64]:
##################################
# Counting the number of object or categorical columns
# with First.Second.Mode.Ratio > 5.00
##################################
len(categorical_column_quality_summary[(categorical_column_quality_summary['First.Second.Mode.Ratio']>5)])
Out[64]:
1
In [65]:
##################################
# Identifying the object or categorical columns
# with First.Second.Mode.Ratio > 5.00
##################################
display(categorical_column_quality_summary[(categorical_column_quality_summary['First.Second.Mode.Ratio']>5)].sort_values(by=['First.Second.Mode.Ratio'], ascending=False))
Categorical.Column.Name First.Mode Second.Mode First.Mode.Count Second.Mode.Count First.Second.Mode.Ratio Unique.Count Row.Count Unique.Count.Ratio
14 LUNG_CANCER YES NO 270 39 6.923077 2 309 0.006472
In [66]:
##################################
# Counting the number of object or categorical columns
# with Unique.Count.Ratio > 10.00
##################################
len(categorical_column_quality_summary[(categorical_column_quality_summary['Unique.Count.Ratio']>10)])
Out[66]:
0

1.4. Data Preprocessing ¶

  1. No data transformation and scaling applied to the numeric predictor due to the minimal number of outliers and normal skewness values.
  2. All dichotomous categorical predictors were one-hot encoded for the correlation analysis process.
  3. All variables were retained since majority reported sufficiently moderate correlation with no excessive multicollinearity.
    • Minimal correlation observed between the predictors using the point-biserial coefficient for evaluating numeric and dichotomous categorical variables.
    • Minimal correlation observed between the predictors using the phi coefficient for evaluating both dichotomous categorical variables.
  4. Among pairwise combinations of variables in the training subset, sufficiently high correlation values were observed but with no excessive multicollinearity noted:
    • ANXIETY and YELLOW_FINGERS: Phi.Coefficient = +0.570
    • ANXIETY and SWALLOWING_DIFFICULTY: Phi.Coefficient = +0.490
    • SHORTNESS_OF_BREATH and FATIGUE: Phi.Coefficient = +0.440
    • COUGHING and WHEEZING: Phi.Coefficient = +0.370
    • SWALLOWING_DIFFICULTY and PEER_PRESSURE: Phi.Coefficient = +0.370
In [67]:
##################################
# Formulating the dataset
# with numeric columns only
##################################
lung_cancer_numeric = lung_cancer.select_dtypes(include=['number','int'])
In [68]:
##################################
# Gathering the variable names for each numeric column
##################################
numeric_variable_name_list = lung_cancer_numeric.columns
In [69]:
##################################
# Gathering the skewness value for each numeric column
##################################
numeric_skewness_list = lung_cancer_numeric.skew()
In [70]:
##################################
# Computing the interquartile range
# for all columns
##################################
lung_cancer_numeric_q1 = lung_cancer_numeric.quantile(0.25)
lung_cancer_numeric_q3 = lung_cancer_numeric.quantile(0.75)
lung_cancer_numeric_iqr = lung_cancer_numeric_q3 - lung_cancer_numeric_q1
In [71]:
##################################
# Gathering the outlier count for each numeric column
# based on the interquartile range criterion
##################################
numeric_outlier_count_list = ((lung_cancer_numeric < (lung_cancer_numeric_q1 - 1.5 * lung_cancer_numeric_iqr)) | (lung_cancer_numeric > (lung_cancer_numeric_q3 + 1.5 * lung_cancer_numeric_iqr))).sum()
In [72]:
##################################
# Gathering the number of observations for each column
##################################
numeric_row_count_list = list([len(lung_cancer_numeric)] * len(lung_cancer_numeric.columns))
In [73]:
##################################
# Gathering the unique to count ratio for each categorical column
##################################
numeric_outlier_ratio_list = map(truediv, numeric_outlier_count_list, numeric_row_count_list)
In [74]:
##################################
# Formulating the outlier summary
# for all numeric columns
##################################
numeric_column_outlier_summary = pd.DataFrame(zip(numeric_variable_name_list,
                                                  numeric_skewness_list,
                                                  numeric_outlier_count_list,
                                                  numeric_row_count_list,
                                                  numeric_outlier_ratio_list), 
                                        columns=['Numeric.Column.Name',
                                                 'Skewness',
                                                 'Outlier.Count',
                                                 'Row.Count',
                                                 'Outlier.Ratio'])
display(numeric_column_outlier_summary)
Numeric.Column.Name Skewness Outlier.Count Row.Count Outlier.Ratio
0 AGE -0.395086 2 309 0.006472
In [75]:
##################################
# Formulating the individual boxplots
# for all numeric columns
##################################
for column in lung_cancer_numeric:
        plt.figure(figsize=(17,1))
        sns.boxplot(data=lung_cancer_numeric, x=column)
No description has been provided for this image
In [76]:
##################################
# Creating a dataset copy and
# converting all values to numeric (integer or float)
# for correlation analysis
##################################
pd.set_option('future.no_silent_downcasting', True)
lung_cancer_correlation = lung_cancer.copy()
lung_cancer_correlation_object = lung_cancer_correlation.iloc[:,2:15].columns
lung_cancer_correlation[lung_cancer_correlation_object] = lung_cancer_correlation[lung_cancer_correlation_object].replace({'Absent': 0, 'Present': 1})
lung_cancer_correlation = lung_cancer_correlation.drop(['GENDER','LUNG_CANCER'], axis=1)
lung_cancer_correlation['AGE'] = lung_cancer_correlation['AGE'].astype(float)
object_cols_to_convert = lung_cancer_correlation.columns[1:]
for col in object_cols_to_convert:
    if lung_cancer_correlation[col].dtype == 'object':
        lung_cancer_correlation[col] = lung_cancer_correlation[col].astype(int)
display(lung_cancer_correlation)
AGE SMOKING YELLOW_FINGERS ANXIETY PEER_PRESSURE CHRONIC_DISEASE FATIGUE ALLERGY WHEEZING ALCOHOL_CONSUMING COUGHING SHORTNESS_OF_BREATH SWALLOWING_DIFFICULTY CHEST_PAIN
0 69.0 0 1 1 0 0 1 0 1 1 1 1 1 1
1 74.0 1 0 0 0 1 1 1 0 0 0 1 1 1
2 59.0 0 0 0 1 0 1 0 1 0 1 1 0 1
3 63.0 1 1 1 0 0 0 0 0 1 0 0 1 1
4 63.0 0 1 0 0 0 0 0 1 0 1 1 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
304 56.0 0 0 0 1 1 1 0 0 1 1 1 1 0
305 70.0 1 0 0 0 0 1 1 1 1 1 1 0 1
306 58.0 1 0 0 0 0 0 1 1 1 1 0 0 1
307 67.0 1 0 1 0 0 1 1 0 1 1 1 0 1
308 62.0 0 0 0 1 0 1 1 1 1 0 0 1 0

309 rows × 14 columns

In [77]:
##################################
# Initializing the correlation matrix
##################################
lung_cancer_correlation_matrix = pd.DataFrame(np.zeros((len(lung_cancer_correlation.columns), len(lung_cancer_correlation.columns))),
                                              columns=lung_cancer_correlation.columns,
                                              index=lung_cancer_correlation.columns)
In [78]:
##################################
# Calculating different types
# of correlation coefficients
# per variable type
##################################
for i in range(len(lung_cancer_correlation.columns)):
    for j in range(i, len(lung_cancer_correlation.columns)):
        if i == j:
            lung_cancer_correlation_matrix.iloc[i, j] = 1.0
        else:
            if lung_cancer_correlation.dtypes.iloc[i] == 'int64' and lung_cancer_correlation.dtypes.iloc[j] == 'int64':
                # Phi coefficient for two binary variables
                corr = lung_cancer_correlation.iloc[:, i].corr(lung_cancer_correlation.iloc[:, j])
            elif lung_cancer_correlation.dtypes.iloc[i] == 'float64' or lung_cancer_correlation.dtypes.iloc[j] == 'int64':
                # Point-biserial correlation for one continuous and one binary variable
                continuous_var = lung_cancer_correlation.iloc[:, i] if lung_cancer_correlation.dtypes.iloc[i] == 'float64' else lung_cancer_correlation.iloc[:, j]
                binary_var = lung_cancer_correlation.iloc[:, j] if lung_cancer_correlation.dtypes.iloc[j] == 'int64' else lung_cancer_correlation.iloc[:, i]
                corr, _ = pointbiserialr(continuous_var, binary_var)
            else:
                # Pearson correlation for two continuous variables
                corr = lung_cancer_correlation.iloc[:, i].corr(lung_cancer_correlation.iloc[:, j])
            lung_cancer_correlation_matrix.iloc[i, j] = corr
            lung_cancer_correlation_matrix.iloc[j, i] = corr
In [79]:
##################################
# Plotting the correlation matrix
# for all pairwise combinations
# of numeric and categorical columns
##################################
plt.figure(figsize=(17, 8))
sns.heatmap(lung_cancer_correlation_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.show()
No description has been provided for this image

1.5. Data Exploration ¶

1.5.1 Exploratory Data Analysis ¶

  1. The lung cancer prevalence estimated for the overall dataset is 87.38%, indicating class imbalance.
  2. Higher counts for the following categorical predictors are associated with better differentiation between LUNG_CANCER=Yes and LUNG_CANCER=No:
    • YELLOW_FINGERS
    • ANXIETY
    • PEER_PRESSURE
    • CHRONIC_DISEASE
    • FATIGUE
    • ALLERGY
    • WHEEZING
    • ALCOHOL_CONSUMING
    • COUGHING
    • SWALLOWING_DIFFICULTY
    • CHEST_PAIN
In [80]:
##################################
# Estimating the lung cancer prevalence
##################################
print('Lung Cancer Prevalence: ')
display(lung_cancer['LUNG_CANCER'].value_counts(normalize = True))
Lung Cancer Prevalence: 
LUNG_CANCER
YES    0.873786
NO     0.126214
Name: proportion, dtype: float64
In [81]:
##################################
# Segregating the target
# and predictor variables
##################################
lung_cancer_predictors = lung_cancer.iloc[:,:-1].columns
lung_cancer_predictors_numeric = lung_cancer.iloc[:,:-1].loc[:,lung_cancer.iloc[:,:-1].columns == 'AGE'].columns
lung_cancer_predictors_categorical = lung_cancer.iloc[:,:-1].loc[:,lung_cancer.iloc[:,:-1].columns != 'AGE'].columns
In [82]:
##################################
# Segregating the target variable
# and numeric predictors
##################################
boxplot_y_variable = 'LUNG_CANCER'
boxplot_x_variable = lung_cancer_predictors_numeric.values[0]
In [83]:
##################################
# Evaluating the numeric predictors
# against the target variable
##################################
plt.figure(figsize=(7, 5))
plt.boxplot([group[boxplot_x_variable] for name, group in lung_cancer.groupby(boxplot_y_variable, observed=True)])
plt.title(f'{boxplot_y_variable} Versus {boxplot_x_variable}')
plt.xlabel(boxplot_y_variable)
plt.ylabel(boxplot_x_variable)
plt.xticks(range(1, len(lung_cancer[boxplot_y_variable].unique()) + 1), ['No', 'Yes'])
plt.show()
No description has been provided for this image
In [84]:
##################################
# Segregating the target variable
# and categorical predictors
##################################
proportion_y_variables = lung_cancer_predictors_categorical
proportion_x_variable = 'LUNG_CANCER'
In [85]:
##################################
# Defining the number of 
# rows and columns for the subplots
##################################
num_rows = 7
num_cols = 2

##################################
# Formulating the subplot structure
##################################
fig, axes = plt.subplots(num_rows, num_cols, figsize=(15, 40))

##################################
# Flattening the multi-row and
# multi-column axes
##################################
axes = axes.ravel()

##################################
# Formulating the individual stacked column plots
# for all categorical columns
##################################
for i, y_variable in enumerate(proportion_y_variables):
    ax = axes[i]
    category_counts = lung_cancer.groupby([proportion_x_variable, y_variable], observed=True).size().unstack(fill_value=0)
    category_proportions = category_counts.div(category_counts.sum(axis=1), axis=0)
    category_proportions.plot(kind='bar', stacked=True, ax=ax)
    ax.set_title(f'{proportion_x_variable} Versus {y_variable}')
    ax.set_xlabel(proportion_x_variable)
    ax.set_ylabel('PROPORTIONS')
    ax.legend(loc="lower center")

##################################
# Adjusting the subplot layout
##################################
plt.tight_layout()

##################################
# Presenting the subplots
##################################
plt.show()
No description has been provided for this image

1.5.2 Hypothesis Testing ¶

  1. The relationship between the numeric predictor to the LUNG_CANCER target variable was statistically evaluated using the following hypotheses:
    • Null: Difference in the means between groups Yes and No is equal to zero
    • Alternative: Difference in the means between groups Yes and No is not equal to zero
  2. There is no sufficient evidence to conclude of a statistically significant difference between the means of the numeric measurements obtained from the LUNG_CANCER groups in 1 numeric predictor given its low t-test statistic value with reported high p-value above the significance level of 0.05.
    • AGE: T.Test.Statistic=-1.574, T.Test.PValue=0.116
  3. The relationship between the categorical predictors to the LUNG_CANCER target variable was statistically evaluated using the following hypotheses:
    • Null: The categorical predictor is independent of the target variable
    • Alternative: The categorical predictor is dependent on the target variable
  4. There is sufficient evidence to conclude of a statistically significant relationship between the individual categories and the LUNG_CANCER groups in 9 categorical predictors given their high chisquare statistic values with reported low p-values less than the significance level of 0.05.
    • ALLERGY: ChiSquare.Test.Statistic=31.238, ChiSquare.Test.PValue=0.000
    • ALCOHOL_CONSUMING: ChiSquare.Test.Statistic=24.005, ChiSquare.Test.PValue=0.000
    • SWALLOWING_DIFFICULTY: ChiSquare.Test.Statistic=19.307, ChiSquare.Test.PValue=0.000
    • WHEEZING: ChiSquare.Test.Statistic=17.723, ChiSquare.Test.PValue=0.000
    • COUGHING: ChiSquare.Test.Statistic=17.606, ChiSquare.Test.PValue=0.000
    • CHEST_PAIN: ChiSquare.Test.Statistic=10.083, ChiSquare.Test.PValue=0.001
    • PEER_PRESSURE: ChiSquare.Test.Statistic=9.641, ChiSquare.Test.PValue=0.001
    • YELLOW_FINGERS: ChiSquare.Test.Statistic=9.088, ChiSquare.Test.PValue=0.002
    • FATIGUE: ChiSquare.Test.Statistic=6.081, ChiSquare.Test.PValue=0.013
    • ANXIETY: ChiSquare.Test.Statistic=5.648, ChiSquare.Test.PValue=0.017
  5. There is no sufficient evidence to conclude of a statistically significant relationship between the individual categories and the LUNG_CANCER groups in 4 categorical predictors given their low chisquare statistic values with reported high p-values greater than the significance level of 0.05.
    • CHRONIC_DISEASE: ChiSquare.Test.Statistic=3.161, ChiSquare.Test.PValue=0.075
    • GENDER: ChiSquare.Test.Statistic=1.021, ChiSquare.Test.PValue=0.312
    • SHORTNESS_OF_BREATH: ChiSquare.Test.Statistic=0.790, ChiSquare.Test.PValue=0.373
    • SMOKING: ChiSquare.Test.Statistic=0.722, ChiSquare.Test.PValue=0.395
In [86]:
##################################
# Computing the t-test 
# statistic and p-values
# between the target variable
# and numeric predictor columns
##################################
lung_cancer_numeric_ttest_target = {}
lung_cancer_numeric = lung_cancer.loc[:,(lung_cancer.columns == 'AGE') | (lung_cancer.columns == 'LUNG_CANCER')]
lung_cancer_numeric_columns = lung_cancer_predictors_numeric
for numeric_column in lung_cancer_numeric_columns:
    group_0 = lung_cancer_numeric[lung_cancer_numeric.loc[:,'LUNG_CANCER']=='NO']
    group_1 = lung_cancer_numeric[lung_cancer_numeric.loc[:,'LUNG_CANCER']=='YES']
    lung_cancer_numeric_ttest_target['LUNG_CANCER_' + numeric_column] = stats.ttest_ind(
        group_0[numeric_column], 
        group_1[numeric_column], 
        equal_var=True)
In [87]:
##################################
# Formulating the pairwise ttest summary
# between the target variable
# and numeric predictor columns
##################################
lung_cancer_numeric_summary = lung_cancer_numeric.from_dict(lung_cancer_numeric_ttest_target, orient='index')
lung_cancer_numeric_summary.columns = ['T.Test.Statistic', 'T.Test.PValue']
display(lung_cancer_numeric_summary.sort_values(by=['T.Test.PValue'], ascending=True).head(len(lung_cancer_predictors_numeric)))
T.Test.Statistic T.Test.PValue
LUNG_CANCER_AGE -1.573857 0.11655
In [88]:
##################################
# Computing the chisquare
# statistic and p-values
# between the target variable
# and categorical predictor columns
##################################
lung_cancer_categorical_chisquare_target = {}
lung_cancer_categorical = lung_cancer.loc[:,(lung_cancer.columns != 'AGE') | (lung_cancer.columns == 'LUNG_CANCER')]
lung_cancer_categorical_columns = lung_cancer_predictors_categorical
for categorical_column in lung_cancer_categorical_columns:
    contingency_table = pd.crosstab(lung_cancer_categorical[categorical_column], 
                                    lung_cancer_categorical['LUNG_CANCER'])
    lung_cancer_categorical_chisquare_target['LUNG_CANCER_' + categorical_column] = stats.chi2_contingency(
        contingency_table)[0:2]
In [89]:
##################################
# Formulating the pairwise chisquare summary
# between the target variable
# and categorical predictor columns
##################################
lung_cancer_categorical_summary = lung_cancer_categorical.from_dict(lung_cancer_categorical_chisquare_target, orient='index')
lung_cancer_categorical_summary.columns = ['ChiSquare.Test.Statistic', 'ChiSquare.Test.PValue']
display(lung_cancer_categorical_summary.sort_values(by=['ChiSquare.Test.PValue'], ascending=True).head(len(lung_cancer_predictors_categorical)))
ChiSquare.Test.Statistic ChiSquare.Test.PValue
LUNG_CANCER_ALLERGY 31.238952 2.281422e-08
LUNG_CANCER_ALCOHOL_CONSUMING 24.005406 9.606559e-07
LUNG_CANCER_SWALLOWING_DIFFICULTY 19.307277 1.112814e-05
LUNG_CANCER_WHEEZING 17.723096 2.555055e-05
LUNG_CANCER_COUGHING 17.606122 2.717123e-05
LUNG_CANCER_CHEST_PAIN 10.083198 1.496275e-03
LUNG_CANCER_PEER_PRESSURE 9.641594 1.902201e-03
LUNG_CANCER_YELLOW_FINGERS 9.088186 2.572659e-03
LUNG_CANCER_FATIGUE 6.081100 1.366356e-02
LUNG_CANCER_ANXIETY 5.648390 1.747141e-02
LUNG_CANCER_CHRONIC_DISEASE 3.161200 7.540772e-02
LUNG_CANCER_GENDER 1.021545 3.121527e-01
LUNG_CANCER_SHORTNESS_OF_BREATH 0.790604 3.739175e-01
LUNG_CANCER_SMOKING 0.722513 3.953209e-01

1.6. Predictive Model Development ¶

1.6.1 Pre-Modelling Data Preparation ¶

  1. All dichotomous categorical predictors and the target variable were one-hot encoded for the downstream modelling process.
  2. Predictors determined with insufficient association with the LUNG_CANCER target variables were excluded for the subsequent modelling steps.
    • AGE: T.Test.Statistic=-1.574, T.Test.PValue=0.116
    • CHRONIC_DISEASE: ChiSquare.Test.Statistic=3.161, ChiSquare.Test.PValue=0.075
    • GENDER: ChiSquare.Test.Statistic=1.021, ChiSquare.Test.PValue=0.312
    • SHORTNESS_OF_BREATH: ChiSquare.Test.Statistic=0.790, ChiSquare.Test.PValue=0.373
    • SMOKING: ChiSquare.Test.Statistic=0.722, ChiSquare.Test.PValue=0.395
In [90]:
##################################
# Creating a dataset copy and
# transforming all values to numeric
# prior to data splitting and modelling
##################################
pd.set_option('future.no_silent_downcasting', True)
lung_cancer_transformed = lung_cancer.copy()
lung_cancer_transformed_object = lung_cancer_transformed.iloc[:,2:15].columns
lung_cancer_transformed['GENDER'] = lung_cancer_transformed['GENDER'].astype('category')
lung_cancer_transformed['GENDER'] = lung_cancer_transformed['GENDER'].cat.rename_categories({'F': 0, 'M': 1})
lung_cancer_transformed['LUNG_CANCER'] = lung_cancer_transformed['LUNG_CANCER'].astype('category')
lung_cancer_transformed['LUNG_CANCER'] = lung_cancer_transformed['LUNG_CANCER'].cat.rename_categories({'NO': 0, 'YES': 1})
lung_cancer_transformed[lung_cancer_transformed_object] = lung_cancer_transformed[lung_cancer_transformed_object].replace({'Absent': 0, 'Present': 1})
display(lung_cancer_transformed)
GENDER AGE SMOKING YELLOW_FINGERS ANXIETY PEER_PRESSURE CHRONIC_DISEASE FATIGUE ALLERGY WHEEZING ALCOHOL_CONSUMING COUGHING SHORTNESS_OF_BREATH SWALLOWING_DIFFICULTY CHEST_PAIN LUNG_CANCER
0 1 69 0 1 1 0 0 1 0 1 1 1 1 1 1 1
1 1 74 1 0 0 0 1 1 1 0 0 0 1 1 1 1
2 0 59 0 0 0 1 0 1 0 1 0 1 1 0 1 0
3 1 63 1 1 1 0 0 0 0 0 1 0 0 1 1 0
4 0 63 0 1 0 0 0 0 0 1 0 1 1 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
304 0 56 0 0 0 1 1 1 0 0 1 1 1 1 0 1
305 1 70 1 0 0 0 0 1 1 1 1 1 1 0 1 1
306 1 58 1 0 0 0 0 0 1 1 1 1 0 0 1 1
307 1 67 1 0 1 0 0 1 1 0 1 1 1 0 1 1
308 1 62 0 0 0 1 0 1 1 1 1 0 0 1 0 1

309 rows × 16 columns

In [91]:
##################################
# Saving the tranformed data
# to the DATASETS_PREPROCESSED_PATH
##################################
lung_cancer_transformed.to_csv(os.path.join("..", DATASETS_PREPROCESSED_PATH, "lung_cancer_transformed.csv"), index=False)
In [92]:
##################################
# Filtering out predictors that did not exhibit 
# sufficient discrimination of the target variable
# Saving the tranformed data
# to the DATASETS_PREPROCESSED_PATH
##################################
lung_cancer_filtered = lung_cancer_transformed.drop(['GENDER','CHRONIC_DISEASE', 'SHORTNESS_OF_BREATH', 'SMOKING', 'AGE'], axis=1)
lung_cancer_filtered.to_csv(os.path.join("..", DATASETS_FINAL_PATH, "lung_cancer_final.csv"), index=False)
display(lung_cancer_filtered)
YELLOW_FINGERS ANXIETY PEER_PRESSURE FATIGUE ALLERGY WHEEZING ALCOHOL_CONSUMING COUGHING SWALLOWING_DIFFICULTY CHEST_PAIN LUNG_CANCER
0 1 1 0 1 0 1 1 1 1 1 1
1 0 0 0 1 1 0 0 0 1 1 1
2 0 0 1 1 0 1 0 1 0 1 0
3 1 1 0 0 0 0 1 0 1 1 0
4 1 0 0 0 0 1 0 1 0 0 0
... ... ... ... ... ... ... ... ... ... ... ...
304 0 0 1 1 0 0 1 1 1 0 1
305 0 0 0 1 1 1 1 1 0 1 1
306 0 0 0 0 1 1 1 1 0 1 1
307 0 1 0 1 1 0 1 1 0 1 1
308 0 0 1 1 1 1 1 0 1 0 1

309 rows × 11 columns

1.6.2 Data Splitting ¶

  1. The preprocessed dataset was divided into three subsets using a fixed random seed:
    • test data: 25% of the original data with class stratification applied
    • train data (initial): 75% of the original data with class stratification applied
      • train data (final): 75% of the train (initial) data with class stratification applied
      • validation data: 25% of the train (initial) data with class stratification applied
  2. Resampling (upsampling and downsampling) algorithms were applied on the train data (final) to evaluate the effects of remedial actions against class imbalance.
  3. Models were developed from the original, upsampled and downsampled train data (final). Using the same dataset, a subset of models with optimal hyperparameters were selected, based on cross-validation.
  4. Among candidate models with optimal hyperparameters, the final model were selected based on performance on the validation data.
  5. Performance of the selected final model (and other candidate models for post-model selection comparison) were evaluated using the test data.
  6. The preprocessed data is comprised of:
    • 309 rows (observations)
      • 270 LUNG_CANCER=Yes: 87.38%
      • 39 LUNG_CANCER=No: 12.82%
    • 11 columns (variables)
      • 1/11 target (categorical)
        • LUNG_CANCER
      • 10/11 predictors (categorical)
        • YELLOW_FINGERS
        • ANXIETY
        • PEER_PRESSURE
        • FATIGUE
        • ALLERGY
        • WHEEZING
        • ALCOHOL_CONSUMING
        • COUGHING
        • SWALLOWING_DIFFICULTY
        • CHEST_PAIN
  7. The train data (final) subset is comprised of:
    • 173 rows (observations)
      • 151 LUNG_CANCER=Yes: 87.28%
      • 22 LUNG_CANCER=No: 12.72%
    • 11 columns (variables)
  8. The validation data subset is comprised of:
    • 58 rows (observations)
      • 51 LUNG_CANCER=Yes: 87.93%
      • 7 LUNG_CANCER=No: 12.07%
    • 11 columns (variables)
  9. The train data (final) subset with SMOTE-upsampled minority class(LUNG_CANCER=No) is comprised of:
    • 302 rows (observations)
      • 151 LUNG_CANCER=Yes: 50.00%
      • 151 LUNG_CANCER=No: 50.00%
    • 11 columns (variables)
  10. The train data (final) subset with CNN-downsampled minority class(LUNG_CANCER=Yes) is comprised of:
    • 173 rows (observations)
      • 39 LUNG_CANCER=Yes: 63.93%
      • 22 LUNG_CANCER=No: 36.07%
    • 11 columns (variables)
In [93]:
##################################
# Creating a dataset copy
# of the filtered data
##################################
lung_cancer_final = lung_cancer_filtered.copy()
In [94]:
##################################
# Performing a general exploration
# of the final dataset
##################################
print('Final Dataset Dimensions: ')
display(lung_cancer_final.shape)
Final Dataset Dimensions: 
(309, 11)
In [95]:
print('Target Variable Breakdown: ')
lung_cancer_breakdown = lung_cancer_final.groupby('LUNG_CANCER', observed=True).size().reset_index(name='Count')
lung_cancer_breakdown['Percentage'] = (lung_cancer_breakdown['Count'] / len(lung_cancer_final)) * 100
display(lung_cancer_breakdown)
Target Variable Breakdown: 
LUNG_CANCER Count Percentage
0 0 39 12.621359
1 1 270 87.378641
In [96]:
##################################
# Formulating the train and test data
# from the final dataset
# by applying stratification and
# using a 70-30 ratio
##################################
lung_cancer_train_initial, lung_cancer_test = train_test_split(lung_cancer_final, 
                                                               test_size=0.25, 
                                                               stratify=lung_cancer_final['LUNG_CANCER'], 
                                                               random_state=88888888)
In [97]:
##################################
# Performing a general exploration
# of the initial training dataset
##################################
X_train_initial = lung_cancer_train_initial.drop('LUNG_CANCER', axis = 1)
y_train_initial = lung_cancer_train_initial['LUNG_CANCER']
print('Initial Training Dataset Dimensions: ')
display(X_train_initial.shape)
display(y_train_initial.shape)
print('Initial Training Target Variable Breakdown: ')
display(y_train_initial.value_counts())
print('Initial Training Target Variable Proportion: ')
display(y_train_initial.value_counts(normalize = True))
Initial Training Dataset Dimensions: 
(231, 10)
(231,)
Initial Training Target Variable Breakdown: 
LUNG_CANCER
1    202
0     29
Name: count, dtype: int64
Initial Training Target Variable Proportion: 
LUNG_CANCER
1    0.874459
0    0.125541
Name: proportion, dtype: float64
In [98]:
##################################
# Performing a general exploration
# of the test dataset
##################################
X_test = lung_cancer_test.drop('LUNG_CANCER', axis = 1)
y_test = lung_cancer_test['LUNG_CANCER']
print('Test Dataset Dimensions: ')
display(X_test.shape)
display(y_test.shape)
print('Test Target Variable Breakdown: ')
display(y_test.value_counts())
print('Test Target Variable Proportion: ')
display(y_test.value_counts(normalize = True))
Test Dataset Dimensions: 
(78, 10)
(78,)
Test Target Variable Breakdown: 
LUNG_CANCER
1    68
0    10
Name: count, dtype: int64
Test Target Variable Proportion: 
LUNG_CANCER
1    0.871795
0    0.128205
Name: proportion, dtype: float64
In [99]:
##################################
# Formulating the train and validation data
# from the train dataset
# by applying stratification and
# using a 70-30 ratio
##################################
lung_cancer_train, lung_cancer_validation = train_test_split(lung_cancer_train_initial, 
                                                             test_size=0.25, 
                                                             stratify=lung_cancer_train_initial['LUNG_CANCER'], 
                                                             random_state=88888888)
In [100]:
##################################
# Performing a general exploration
# of the final training dataset
##################################
X_train = lung_cancer_train.drop('LUNG_CANCER', axis = 1)
y_train = lung_cancer_train['LUNG_CANCER']
print('Final Training Dataset Dimensions: ')
display(X_train.shape)
display(y_train.shape)
print('Final Training Target Variable Breakdown: ')
display(y_train.value_counts())
print('Final Training Target Variable Proportion: ')
display(y_train.value_counts(normalize = True))
Final Training Dataset Dimensions: 
(173, 10)
(173,)
Final Training Target Variable Breakdown: 
LUNG_CANCER
1    151
0     22
Name: count, dtype: int64
Final Training Target Variable Proportion: 
LUNG_CANCER
1    0.872832
0    0.127168
Name: proportion, dtype: float64
In [101]:
##################################
# Performing a general exploration
# of the validation dataset
##################################
X_validation = lung_cancer_validation.drop('LUNG_CANCER', axis = 1)
y_validation = lung_cancer_validation['LUNG_CANCER']
print('Validation Dataset Dimensions: ')
display(X_validation.shape)
display(y_validation.shape)
print('Validation Target Variable Breakdown: ')
display(y_validation.value_counts())
print('Validation Target Variable Proportion: ')
display(y_validation.value_counts(normalize = True))
Validation Dataset Dimensions: 
(58, 10)
(58,)
Validation Target Variable Breakdown: 
LUNG_CANCER
1    51
0     7
Name: count, dtype: int64
Validation Target Variable Proportion: 
LUNG_CANCER
1    0.87931
0    0.12069
Name: proportion, dtype: float64
In [102]:
##################################
# Initiating an oversampling instance
# on the training data using
# Synthetic Minority Oversampling Technique
##################################
smote = SMOTE(random_state = 88888888)
X_train_smote, y_train_smote = smote.fit_resample(X_train,y_train)
print('SMOTE-Upsampled Training Dataset Dimensions: ')
display(X_train_smote.shape)
display(y_train_smote.shape)
print('SMOTE-Upsampled Training Target Variable Breakdown: ')
display(y_train_smote.value_counts())
print('SMOTE-Upsampled Training Target Variable Proportion: ')
display(y_train_smote.value_counts(normalize = True))
SMOTE-Upsampled Training Dataset Dimensions: 
(302, 10)
(302,)
SMOTE-Upsampled Training Target Variable Breakdown: 
LUNG_CANCER
0    151
1    151
Name: count, dtype: int64
SMOTE-Upsampled Training Target Variable Proportion: 
LUNG_CANCER
0    0.5
1    0.5
Name: proportion, dtype: float64
In [103]:
##################################
# Initiating an undersampling instance
# on the training data using
# Condense Nearest Neighbors
##################################
cnn = CondensedNearestNeighbour(random_state = 88888888, n_neighbors=3)
X_train_cnn, y_train_cnn = cnn.fit_resample(X_train,y_train)
print('Downsampled Training Dataset Dimensions: ')
display(X_train_cnn.shape)
display(y_train_cnn.shape)
print('Downsampled Training Target Variable Breakdown: ')
display(y_train_cnn.value_counts())
print('Downsampled Training Target Variable Proportion: ')
display(y_train_cnn.value_counts(normalize = True))
Downsampled Training Dataset Dimensions: 
(61, 10)
(61,)
Downsampled Training Target Variable Breakdown: 
LUNG_CANCER
1    39
0    22
Name: count, dtype: int64
Downsampled Training Target Variable Proportion: 
LUNG_CANCER
1    0.639344
0    0.360656
Name: proportion, dtype: float64
In [104]:
##################################
# Saving the training data
# to the DATASETS_FINAL_TRAIN_PATH
# and DATASETS_FINAL_TRAIN_FEATURES_PATH
# and DATASETS_FINAL_TRAIN_TARGET_PATH
##################################
lung_cancer_train.to_csv(os.path.join("..", DATASETS_FINAL_TRAIN_PATH, "lung_cancer_train.csv"), index=False)
X_train.to_csv(os.path.join("..", DATASETS_FINAL_TRAIN_FEATURES_PATH, "X_train.csv"), index=False)
y_train.to_csv(os.path.join("..", DATASETS_FINAL_TRAIN_TARGET_PATH, "y_train.csv"), index=False)
X_train_smote.to_csv(os.path.join("..", DATASETS_FINAL_TRAIN_FEATURES_PATH, "X_train_smote.csv"), index=False)
y_train_smote.to_csv(os.path.join("..", DATASETS_FINAL_TRAIN_TARGET_PATH, "y_train_smote.csv"), index=False)
X_train_cnn.to_csv(os.path.join("..", DATASETS_FINAL_TRAIN_FEATURES_PATH, "X_train_cnn.csv"), index=False)
y_train_cnn.to_csv(os.path.join("..", DATASETS_FINAL_TRAIN_TARGET_PATH, "y_train_cnn.csv"), index=False)
In [105]:
##################################
# Saving the validation data
# to the DATASETS_FINAL_VALIDATION_PATH
# and DATASETS_FINAL_VALIDATION_FEATURE_PATH
# and DATASETS_FINAL_VALIDATION_TARGET_PATH
##################################
lung_cancer_validation.to_csv(os.path.join("..", DATASETS_FINAL_VALIDATION_PATH, "lung_cancer_validation.csv"), index=False)
X_validation.to_csv(os.path.join("..", DATASETS_FINAL_VALIDATION_FEATURES_PATH, "X_validation.csv"), index=False)
y_validation.to_csv(os.path.join("..", DATASETS_FINAL_VALIDATION_TARGET_PATH, "y_validation.csv"), index=False)
In [106]:
##################################
# Saving the test data
# to the DATASETS_FINAL_TEST_PATH
# and DATASETS_FINAL_TEST_FEATURES_PATH
# and DATASETS_FINAL_TEST_TARGET_PATH
##################################
lung_cancer_test.to_csv(os.path.join("..", DATASETS_FINAL_TEST_PATH, "lung_cancer_test.csv"), index=False)
X_test.to_csv(os.path.join("..", DATASETS_FINAL_TEST_FEATURES_PATH, "X_test.csv"), index=False)
y_test.to_csv(os.path.join("..", DATASETS_FINAL_TEST_TARGET_PATH, "y_test.csv"), index=False)

1.6.3 Modelling Pipeline Development ¶

1.6.3.1 Individual Classifier ¶

Logistic Regression models the relationship between the probability of an event (among two outcome levels) by having the log-odds of the event be a linear combination of a set of predictors weighted by their respective parameter estimates. The parameters are estimated via maximum likelihood estimation by testing different values through multiple iterations to optimize for the best fit of log odds. All of these iterations produce the log likelihood function, and logistic regression seeks to maximize this function to find the best parameter estimates. Given the optimal parameters, the conditional probabilities for each observation can be calculated, logged, and summed together to yield a predicted probability.

Class Weights are used to assign different levels of importance to different classes when the distribution of instances across different classes in a classification problem is not equal. By assigning higher weights to the minority class, the model is encouraged to give more attention to correctly predicting instances from the minority class. Class weights are incorporated into the loss function during training. The loss for each instance is multiplied by its corresponding class weight. This means that misclassifying an instance from the minority class will have a greater impact on the overall loss than misclassifying an instance from the majority class. The use of class weights helps balance the influence of each class during training, mitigating the impact of class imbalance. It provides a way to focus the learning process on the classes that are underrepresented in the training data.

Hyperparameter Tuning is an iterative process that involves experimenting with different hyperparameter combinations, evaluating the model's performance, and refining the hyperparameter values to achieve the best possible performance on new, unseen data - aimed at building effective and well-generalizing machine learning models. A model's performance depends not only on the learned parameters (weights) during training but also on hyperparameters, which are external configuration settings that cannot be learned from the data.

  1. A modelling pipeline using an individual classifier was implemented.
    • Logistic regression model from the sklearn.linear_model Python library API with 5 hyperparameters:
      • penalty = penalty norm made to vary between L1, L2 and none
      • class_weight = weights associated with classes held constant at a value equal to balanced or none, as applicable
      • solver = algorithm used in the optimization problem held constant at a value equal to saga
      • max_iter = maximum number of iterations taken for the solvers to converge held constant at a value of 500
      • random_state = random instance to shuffle the data for the solver algorithm held constant at a value of 88888888
  2. Hyperparameter tuning was conducted using the 5-fold cross-validation method with optimal model performance determined using the F1 score.
In [107]:
##################################
# Defining the modelling pipeline
# using the logistic regression structure
##################################
individual_pipeline = Pipeline([('individual_model', LogisticRegression(solver='saga', 
                                                             random_state=88888888, 
                                                             max_iter=5000))])
In [108]:
##################################
# Defining the hyperparameters for grid search
# including the regularization penalties
# and class weights for unbalanced class
##################################
individual_unbalanced_class_hyperparameter_grid = {'individual_model__penalty': ['l1', 'l2', None],
                                                   'individual_model__class_weight': ['balanced']}
In [109]:
##################################
# Setting up the GridSearchCV with 5-fold cross-validation
# and using F1 score as the model evaluation metric
##################################
individual_unbalanced_class_grid_search = GridSearchCV(estimator=individual_pipeline,
                                                       param_grid=individual_unbalanced_class_hyperparameter_grid,
                                                       scoring='f1',
                                                       cv=5, 
                                                       n_jobs=-1,
                                                       verbose=1)
In [110]:
##################################
# Defining the hyperparameters for grid search
# including the regularization penalties
# and class weights for unbalanced class
##################################
individual_balanced_class_hyperparameter_grid = {'individual_model__penalty': ['l1', 'l2', None],
                                                 'individual_model__class_weight': [None]}
In [111]:
##################################
# Setting up the GridSearchCV with 5-fold cross-validation
# and using F1 score as the model evaluation metric
##################################
individual_balanced_class_grid_search = GridSearchCV(estimator=individual_pipeline,
                                                     param_grid=individual_balanced_class_hyperparameter_grid,
                                                     scoring='f1',
                                                     cv=5, 
                                                     n_jobs=-1,
                                                     verbose=1)

1.6.3.2 Stacked Classifier ¶

Logistic Regression models the relationship between the probability of an event (among two outcome levels) by having the log-odds of the event be a linear combination of a set of predictors weighted by their respective parameter estimates. The parameters are estimated via maximum likelihood estimation by testing different values through multiple iterations to optimize for the best fit of log odds. All of these iterations produce the log likelihood function, and logistic regression seeks to maximize this function to find the best parameter estimates. Given the optimal parameters, the conditional probabilities for each observation can be calculated, logged, and summed together to yield a predicted probability.

Decision Trees create a model that predicts the class label of a sample based on input features. A decision tree consists of nodes that represent decisions or choices, edges which connect nodes and represent the possible outcomes of a decision and leaf (or terminal) nodes which represent the final decision or the predicted class label. The decision-making process involves feature selection (at each internal node, the algorithm decides which feature to split on based on a certain criterion including gini impurity or entropy), splitting criteria (the splitting criteria aim to find the feature and its corresponding threshold that best separates the data into different classes. The goal is to increase homogeneity within each resulting subset), recursive splitting (the process of feature selection and splitting continues recursively, creating a tree structure. The dataset is partitioned at each internal node based on the chosen feature, and the process repeats for each subset) and stopping criteria (the recursion stops when a certain condition is met, known as a stopping criterion. Common stopping criteria include a maximum depth for the tree, a minimum number of samples required to split a node, or a minimum number of samples in a leaf node.)

Random Forest is an ensemble learning method made up of a large set of small decision trees called estimators, with each producing its own prediction. The random forest model aggregates the predictions of the estimators to produce a more accurate prediction. The algorithm involves bootstrap aggregating (where smaller subsets of the training data are repeatedly subsampled with replacement), random subspacing (where a subset of features are sampled and used to train each individual estimator), estimator training (where unpruned decision trees are formulated for each estimator) and inference by aggregating the predictions of all estimators.

Support Vector Machine plots each observation in an N-dimensional space corresponding to the number of features in the data set and finds a hyperplane that maximally separates the different classes by a maximally large margin (which is defined as the distance between the hyperplane and the closest data points from each class). The algorithm applies kernel transformation by mapping non-linearly separable data using the similarities between the points in a high-dimensional feature space for improved discrimination.

Hyperparameter Tuning is an iterative process that involves experimenting with different hyperparameter combinations, evaluating the model's performance, and refining the hyperparameter values to achieve the best possible performance on new, unseen data - aimed at building effective and well-generalizing machine learning models. A model's performance depends not only on the learned parameters (weights) during training but also on hyperparameters, which are external configuration settings that cannot be learned from the data.

Model Stacking - also known as stacked generalization, is an ensemble approach which involves creating a variety of base learners and using them to create intermediate predictions, one for each learned model. A meta-model is incorporated that gains knowledge of the same target from intermediate predictions. Unlike bagging, in stacking, the models are typically different (e.g. not all decision trees) and fit on the same dataset (e.g. instead of samples of the training dataset). Unlike boosting, in stacking, a single model is used to learn how to best combine the predictions from the contributing models (e.g. instead of a sequence of models that correct the predictions of prior models). Stacking is appropriate when the predictions made by the base learners or the errors in predictions made by the models have minimal correlation. Achieving an improvement in performance is dependent upon the choice of base learners and whether they are sufficiently skillful in their predictions.

  1. A modelling pipeline using a stacking classifier was implemented.
    • Meta-learner: Logistic regression model from the sklearn.linear_model Python library API with 5 hyperparameters:
      • penalty = penalty norm made to vary between L1, L2 and none
      • class_weight = weights associated with classes held constant at a value equal to balanced or none, as applicable
      • solver = algorithm used in the optimization problem held constant at a value equal to saga
      • max_iter = maximum number of iterations taken for the solvers to converge held constant at 500
      • random_state = random instance to shuffle the data for the solver algorithm held constant at 88888888
    • Base learner: Decision tree model from the sklearn.linear_model Python library API with 5 hyperparameters:
      • max_depth = maximum depth of the tree made to vary between 3 and 5
      • class_weight = weights associated with classes held constant at a value equal to balanced or none, as applicable
      • criterion = function to measure the quality of a split held constant at a value equal to entropy
      • min_samples_leaf = minimum number of samples required to split an internal node held constant at 3
      • random_state = random instance for feature permutation process of the algorithm held constant at 88888888
    • Base learner: Random forest model from the sklearn.linear_model Python library API with 6 hyperparameters:
      • max_depth = maximum depth of the tree made to vary between 3 and 5
      • class_weight = weights associated with classes held constant at a value equal to balanced or none, as applicable
      • criterion = function to measure the quality of a split held constant at a value equal to entropy
      • max_features = number of features to consider when looking for the best split held constant at a value equal to sqrt
      • min_samples_leaf = minimum number of samples required to split an internal node held constant at 3
      • random_state = random instance for controlling the bootstrapping of the samples and feature sampling of the algorithm held constant at 88888888
    • Base learner: Support vector machine model from the sklearn.linear_model Python library API with 5 hyperparameters:
      • C = inverse of regularization strength made to vary between 1.0 and 0.5
      • class_weight = weights associated with classes held constant at a value equal to balanced or none, as applicable
      • kernel = kernel type to be used in the algorithm made held constant at a value equal to linear
      • probability = setting to enable probability estimates held constant at a value equal to true
      • random_state = random instance for controling data shuffle for probability estimation of the algorithm held constant at 88888888
  2. Hyperparameter tuning was conducted using the 5-fold cross-validation method with optimal model performance determined using the F1 score.
In [112]:
##################################
# Defining the base learners
# for the stacked classifier
# composed of decision tree,
# random forest, and support vector machine
##################################
stacked_unbalanced_class_base_learners = [('dt', DecisionTreeClassifier(class_weight='balanced',
                                                                         criterion='entropy',
                                                                         min_samples_leaf=3,
                                                                         random_state=88888888)),
                                           ('rf', RandomForestClassifier(class_weight='balanced',
                                                                         criterion='entropy',
                                                                         max_features='sqrt',
                                                                         min_samples_leaf=3,
                                                                         random_state=88888888)),
                                           ('svm', SVC(class_weight='balanced',
                                                       probability=True,
                                                       kernel='linear',
                                                       random_state=88888888))]
In [113]:
##################################
# Defining the meta-learner
# using the logistic regression structure
##################################
stacked_unbalanced_class_meta_learner = LogisticRegression(solver='saga', 
                                                           random_state=88888888,
                                                           max_iter=5000)
In [114]:
##################################
# Defining the stacking model
# using the logistic regression structure
##################################
stacked_unbalanced_class_model = StackingClassifier(estimators=stacked_unbalanced_class_base_learners,
                                                    final_estimator=stacked_unbalanced_class_meta_learner)
In [115]:
##################################
# Defining the modelling pipeline
# for the stacked classifier
# composed of decision tree,
# random forest, and support vector machine
# using the logistic regression structure
##################################
stacked_unbalanced_class_pipeline = Pipeline([('stacked_model', stacked_unbalanced_class_model)])
In [116]:
##################################
# Defining the hyperparameters for grid search
# including the regularization penalties
# and class weights for unbalanced class
##################################
stacked_unbalanced_class_hyperparameter_grid = {'stacked_model__dt__max_depth': [3, 5],
                                                'stacked_model__rf__max_depth': [3, 5],
                                                'stacked_model__svm__C': [0.50, 1.00],
                                                'stacked_model__final_estimator__penalty': ['l1', 'l2', None],
                                                'stacked_model__final_estimator__class_weight': ['balanced']}
In [117]:
##################################
# Setting up the GridSearchCV with 5-fold cross-validation
# and using F1 score as the model evaluation metric
##################################
stacked_unbalanced_class_grid_search = GridSearchCV(estimator=stacked_unbalanced_class_pipeline,
                                                    param_grid=stacked_unbalanced_class_hyperparameter_grid,
                                                    scoring='f1',
                                                    cv=5,
                                                    n_jobs=-1,
                                                    verbose=1)
In [118]:
##################################
# Defining the base learners
# for the stacked classifier
# composed of decision tree,
# random forest, and support vector machine
##################################
stacked_balanced_class_base_learners = [('dt', DecisionTreeClassifier(class_weight=None,
                                                                         criterion='entropy',
                                                                         min_samples_leaf=3,
                                                                         random_state=88888888)),
                                           ('rf', RandomForestClassifier(class_weight=None,
                                                                         criterion='entropy',
                                                                         max_features='sqrt',
                                                                         min_samples_leaf=3,
                                                                         random_state=88888888)),
                                           ('svm', SVC(class_weight=None,
                                                       probability=True,
                                                       kernel='linear',
                                                       random_state=88888888))]
In [119]:
##################################
# Defining the meta-learner
# using the logistic regression structure
##################################
stacked_balanced_class_meta_learner = LogisticRegression(solver='saga', 
                                                           random_state=88888888,
                                                           max_iter=5000)
In [120]:
##################################
# Defining the stacking model
# using the logistic regression structure
##################################
stacked_balanced_class_model = StackingClassifier(estimators=stacked_balanced_class_base_learners,
                                                    final_estimator=stacked_balanced_class_meta_learner)
In [121]:
##################################
# Defining the modelling pipeline
# for the stacked classifier
# composed of decision tree,
# random forest, and support vector machine
# using the logistic regression structure
##################################
stacked_balanced_class_pipeline = Pipeline([('stacked_model', stacked_balanced_class_model)])
In [122]:
##################################
# Defining the hyperparameters for grid search
# including the regularization penalties
# and class weights for balanced class
##################################
stacked_balanced_class_hyperparameter_grid = {'stacked_model__dt__max_depth': [3, 5],
                                                'stacked_model__rf__max_depth': [3, 5],
                                                'stacked_model__svm__C': [0.50, 1.00],
                                                'stacked_model__final_estimator__penalty': ['l1', 'l2', None],
                                                'stacked_model__final_estimator__class_weight': [None]}
In [123]:
##################################
# Setting up the GridSearchCV with 5-fold cross-validation
# and using F1 score as the model evaluation metric
##################################
stacked_balanced_class_grid_search = GridSearchCV(estimator=stacked_balanced_class_pipeline,
                                                    param_grid=stacked_balanced_class_hyperparameter_grid,
                                                    scoring='f1',
                                                    cv=5,
                                                    n_jobs=-1,
                                                    verbose=1)

1.6.4 Model Fitting using Original Training Data | Hyperparameter Tuning | Validation ¶

1.6.4.1 Individual Classifier ¶

Logistic Regression models the relationship between the probability of an event (among two outcome levels) by having the log-odds of the event be a linear combination of a set of predictors weighted by their respective parameter estimates. The parameters are estimated via maximum likelihood estimation by testing different values through multiple iterations to optimize for the best fit of log odds. All of these iterations produce the log likelihood function, and logistic regression seeks to maximize this function to find the best parameter estimates. Given the optimal parameters, the conditional probabilities for each observation can be calculated, logged, and summed together to yield a predicted probability.

Class Weights are used to assign different levels of importance to different classes when the distribution of instances across different classes in a classification problem is not equal. By assigning higher weights to the minority class, the model is encouraged to give more attention to correctly predicting instances from the minority class. Class weights are incorporated into the loss function during training. The loss for each instance is multiplied by its corresponding class weight. This means that misclassifying an instance from the minority class will have a greater impact on the overall loss than misclassifying an instance from the majority class. The use of class weights helps balance the influence of each class during training, mitigating the impact of class imbalance. It provides a way to focus the learning process on the classes that are underrepresented in the training data.

Hyperparameter Tuning is an iterative process that involves experimenting with different hyperparameter combinations, evaluating the model's performance, and refining the hyperparameter values to achieve the best possible performance on new, unseen data - aimed at building effective and well-generalizing machine learning models. A model's performance depends not only on the learned parameters (weights) during training but also on hyperparameters, which are external configuration settings that cannot be learned from the data.

  1. The optimal logistic regression model (individual classifier) determined from the 5-fold cross-validation of train data (final) contained the following hyperparameters:
    • penalty = L2
    • class_weight = balanced
    • solver = saga
    • max_iter = 500
    • random_state = 88888888
  2. The F1 scores estimated for the different data subsets were as follows:
    • train data (final) = 0.9306
    • train data (cross-validated) = 0.9116
    • validation data = 0.9495
  3. Moderate overfitting noted based on the considerable difference in the apparent and cross-validated F1 scores.
In [124]:
##################################
# Fitting the model on the 
# original training data
##################################
individual_unbalanced_class_grid_search.fit(X_train, y_train)
Fitting 5 folds for each of 3 candidates, totalling 15 fits
Out[124]:
GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('individual_model',
                                        LogisticRegression(max_iter=5000,
                                                           random_state=88888888,
                                                           solver='saga'))]),
             n_jobs=-1,
             param_grid={'individual_model__class_weight': ['balanced'],
                         'individual_model__penalty': ['l1', 'l2', None]},
             scoring='f1', verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('individual_model',
                                        LogisticRegression(max_iter=5000,
                                                           random_state=88888888,
                                                           solver='saga'))]),
             n_jobs=-1,
             param_grid={'individual_model__class_weight': ['balanced'],
                         'individual_model__penalty': ['l1', 'l2', None]},
             scoring='f1', verbose=1)
Pipeline(steps=[('individual_model',
                 LogisticRegression(class_weight='balanced', max_iter=5000,
                                    random_state=88888888, solver='saga'))])
LogisticRegression(class_weight='balanced', max_iter=5000,
                   random_state=88888888, solver='saga')
In [125]:
##################################
# Identifying the best model
##################################
individual_unbalanced_class_best_model_original = individual_unbalanced_class_grid_search.best_estimator_
In [126]:
##################################
# Evaluating the F1 scores
# on the training, cross-validation, and validation data
##################################
individual_unbalanced_class_best_model_original_f1_cv = individual_unbalanced_class_grid_search.best_score_
individual_unbalanced_class_best_model_original_f1_train = f1_score(y_train, individual_unbalanced_class_best_model_original.predict(X_train))
individual_unbalanced_class_best_model_original_f1_validation = f1_score(y_validation, individual_unbalanced_class_best_model_original.predict(X_validation))
In [127]:
##################################
# Identifying the optimal model
##################################
print('Best Individual Model using the Original Train Data: ')
print(f"Best Individual Model Parameters: {individual_unbalanced_class_grid_search.best_params_}")
Best Individual Model using the Original Train Data: 
Best Individual Model Parameters: {'individual_model__class_weight': 'balanced', 'individual_model__penalty': 'l2'}
In [128]:
##################################
# Summarizing the F1 score results
# on the training and cross-validated data
# to assess overfitting optimism
##################################
print(f"F1 Score on Cross-Validated Data: {individual_unbalanced_class_best_model_original_f1_cv:.4f}")
print(f"F1 Score on Training Data: {individual_unbalanced_class_best_model_original_f1_train:.4f}")
print("\nClassification Report on Training Data:\n", classification_report(y_train, individual_unbalanced_class_best_model_original.predict(X_train)))
F1 Score on Cross-Validated Data: 0.9116
F1 Score on Training Data: 0.9306

Classification Report on Training Data:
               precision    recall  f1-score   support

           0       0.53      0.86      0.66        22
           1       0.98      0.89      0.93       151

    accuracy                           0.88       173
   macro avg       0.75      0.88      0.79       173
weighted avg       0.92      0.88      0.90       173

In [129]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the training data
##################################
cm_raw = confusion_matrix(y_train, individual_unbalanced_class_best_model_original.predict(X_train))
cm_normalized = confusion_matrix(y_train, individual_unbalanced_class_best_model_original.predict(X_train), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Confusion Matrix (Raw Count): Best Individual Model on Training Data')
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Confusion Matrix (Normalized): Best Individual Model on Training Data')
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [130]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
##################################
print(f"F1 Score on Validation Data: {individual_unbalanced_class_best_model_original_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_validation, individual_unbalanced_class_best_model_original.predict(X_validation)))
F1 Score on Validation Data: 0.9495

Classification Report on Validation Data:
               precision    recall  f1-score   support

           0       0.60      0.86      0.71         7
           1       0.98      0.92      0.95        51

    accuracy                           0.91        58
   macro avg       0.79      0.89      0.83        58
weighted avg       0.93      0.91      0.92        58

In [131]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_validation, individual_unbalanced_class_best_model_original.predict(X_validation))
cm_normalized = confusion_matrix(y_validation, individual_unbalanced_class_best_model_original.predict(X_validation), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Confusion Matrix (Raw Count): Best Individual Model on Validation Data')
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Confusion Matrix (Normalized): Best Individual Model on Validation Data')
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [132]:
##################################
# Obtaining the logit values (log-odds)
# from the decision function for training data
##################################
individual_unbalanced_class_best_model_original_logit_values = individual_unbalanced_class_best_model_original.decision_function(X_train)
In [133]:
##################################
# Obtaining the estimated probabilities 
# for the positive class (LUNG_CANCER=YES) for training data
##################################
individual_unbalanced_class_best_model_original_probabilities = individual_unbalanced_class_best_model_original.predict_proba(X_train)[:, 1]
In [134]:
##################################
# Sorting the values to generate
# a smoother curve
##################################
individual_unbalanced_class_best_model_original_sorted_indices = np.argsort(individual_unbalanced_class_best_model_original_logit_values)
individual_unbalanced_class_best_model_original_logit_values_sorted = individual_unbalanced_class_best_model_original_logit_values[individual_unbalanced_class_best_model_original_sorted_indices]
individual_unbalanced_class_best_model_original_probabilities_sorted = individual_unbalanced_class_best_model_original_probabilities[individual_unbalanced_class_best_model_original_sorted_indices]
In [135]:
##################################
# Plotting the estimated logistic curve
# using the logit values
# and estimated probabilities
# obtained from the training data
##################################
plt.figure(figsize=(17, 8))
plt.plot(individual_unbalanced_class_best_model_original_logit_values_sorted, 
         individual_unbalanced_class_best_model_original_probabilities_sorted, label='Classification Model Logistic Curve', color='black')
plt.ylim(-0.05, 1.05)
plt.xlim(-8.00, 8.00)
target_0_indices = y_train == 0
target_1_indices = y_train == 1
plt.scatter(individual_unbalanced_class_best_model_original_logit_values[target_0_indices], 
            individual_unbalanced_class_best_model_original_probabilities[target_0_indices], 
            color='blue', alpha=0.40, s=100, marker= 'o', edgecolor='k', label='LUNG_CANCER=NO')
plt.scatter(individual_unbalanced_class_best_model_original_logit_values[target_1_indices], 
            individual_unbalanced_class_best_model_original_probabilities[target_1_indices], 
            color='red', alpha=0.40, s=100, marker='o', edgecolor='k', label='LUNG_CANCER=YES')
plt.axhline(0.5, color='green', linestyle='--', label='Classification Model Threshold')
plt.title('Logistic Curve (Original Training Data): Individual Model')
plt.xlabel('Logit (Log-Odds)')
plt.ylabel('Estimated Lung Cancer Probability')
plt.grid(True)
plt.legend(loc='upper left')
plt.show()
No description has been provided for this image
In [136]:
##################################
# Saving the best individual model
# developed from the original training data
################################## 
joblib.dump(individual_unbalanced_class_best_model_original, 
            os.path.join("..", MODELS_PATH, "individual_unbalanced_class_best_model_original.pkl"))
Out[136]:
['..\\models\\individual_unbalanced_class_best_model_original.pkl']

1.6.4.2 Stacked Classifier ¶

Logistic Regression models the relationship between the probability of an event (among two outcome levels) by having the log-odds of the event be a linear combination of a set of predictors weighted by their respective parameter estimates. The parameters are estimated via maximum likelihood estimation by testing different values through multiple iterations to optimize for the best fit of log odds. All of these iterations produce the log likelihood function, and logistic regression seeks to maximize this function to find the best parameter estimates. Given the optimal parameters, the conditional probabilities for each observation can be calculated, logged, and summed together to yield a predicted probability.

Decision Trees create a model that predicts the class label of a sample based on input features. A decision tree consists of nodes that represent decisions or choices, edges which connect nodes and represent the possible outcomes of a decision and leaf (or terminal) nodes which represent the final decision or the predicted class label. The decision-making process involves feature selection (at each internal node, the algorithm decides which feature to split on based on a certain criterion including gini impurity or entropy), splitting criteria (the splitting criteria aim to find the feature and its corresponding threshold that best separates the data into different classes. The goal is to increase homogeneity within each resulting subset), recursive splitting (the process of feature selection and splitting continues recursively, creating a tree structure. The dataset is partitioned at each internal node based on the chosen feature, and the process repeats for each subset) and stopping criteria (the recursion stops when a certain condition is met, known as a stopping criterion. Common stopping criteria include a maximum depth for the tree, a minimum number of samples required to split a node, or a minimum number of samples in a leaf node.)

Random Forest is an ensemble learning method made up of a large set of small decision trees called estimators, with each producing its own prediction. The random forest model aggregates the predictions of the estimators to produce a more accurate prediction. The algorithm involves bootstrap aggregating (where smaller subsets of the training data are repeatedly subsampled with replacement), random subspacing (where a subset of features are sampled and used to train each individual estimator), estimator training (where unpruned decision trees are formulated for each estimator) and inference by aggregating the predictions of all estimators.

Support Vector Machine plots each observation in an N-dimensional space corresponding to the number of features in the data set and finds a hyperplane that maximally separates the different classes by a maximally large margin (which is defined as the distance between the hyperplane and the closest data points from each class). The algorithm applies kernel transformation by mapping non-linearly separable data using the similarities between the points in a high-dimensional feature space for improved discrimination.

Class Weights are used to assign different levels of importance to different classes when the distribution of instances across different classes in a classification problem is not equal. By assigning higher weights to the minority class, the model is encouraged to give more attention to correctly predicting instances from the minority class. Class weights are incorporated into the loss function during training. The loss for each instance is multiplied by its corresponding class weight. This means that misclassifying an instance from the minority class will have a greater impact on the overall loss than misclassifying an instance from the majority class. The use of class weights helps balance the influence of each class during training, mitigating the impact of class imbalance. It provides a way to focus the learning process on the classes that are underrepresented in the training data.

Hyperparameter Tuning is an iterative process that involves experimenting with different hyperparameter combinations, evaluating the model's performance, and refining the hyperparameter values to achieve the best possible performance on new, unseen data - aimed at building effective and well-generalizing machine learning models. A model's performance depends not only on the learned parameters (weights) during training but also on hyperparameters, which are external configuration settings that cannot be learned from the data.

Model Stacking - also known as stacked generalization, is an ensemble approach which involves creating a variety of base learners and using them to create intermediate predictions, one for each learned model. A meta-model is incorporated that gains knowledge of the same target from intermediate predictions. Unlike bagging, in stacking, the models are typically different (e.g. not all decision trees) and fit on the same dataset (e.g. instead of samples of the training dataset). Unlike boosting, in stacking, a single model is used to learn how to best combine the predictions from the contributing models (e.g. instead of a sequence of models that correct the predictions of prior models). Stacking is appropriate when the predictions made by the base learners or the errors in predictions made by the models have minimal correlation. Achieving an improvement in performance is dependent upon the choice of base learners and whether they are sufficiently skillful in their predictions.

  1. The optimal decision tree model (base learner) determined from the 5-fold cross-validation of train data (final) contained the following hyperparameters:
    • max_depth = 3
    • class_weight = balanced
    • criterion = entropy
    • min_samples_leaf = 3
    • random_state = 88888888
  2. The optimal random forest model (base learner) determined from the 5-fold cross-validation of train data (final) contained the following hyperparameters:
    • max_depth = 5
    • class_weight = balanced
    • criterion = entropy
    • max_features = sqrt
    • min_samples_leaf = 3
    • random_state = 88888888
  3. The optimal support vector machine model (base learner) determined from the 5-fold cross-validation of train data (final) contained the following hyperparameters:
    • C = 0.50
    • class_weight = balanced
    • kernel = linear
    • probability = true
    • random_state = 88888888
  4. The optimal logistic regression model (meta-learner) determined from the 5-fold cross-validation of train data (final) contained the following hyperparameters:
    • penalty = L1
    • class_weight = balanced
    • solver = saga
    • max_iter = 500
    • random_state = 88888888
  5. The F1 scores estimated for the different data subsets were as follows:
    • train data (final) = 0.9404
    • train data (cross-validated) = 0.9125
    • validation data = 0.9149
  6. Moderate overfitting noted based on the considerable difference in the apparent and cross-validated F1 scores.
In [137]:
##################################
# Fitting the model on the 
# original training data
##################################
stacked_unbalanced_class_grid_search.fit(X_train, y_train)
Fitting 5 folds for each of 24 candidates, totalling 120 fits
Out[137]:
GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('stacked_model',
                                        StackingClassifier(estimators=[('dt',
                                                                        DecisionTreeClassifier(class_weight='balanced',
                                                                                               criterion='entropy',
                                                                                               min_samples_leaf=3,
                                                                                               random_state=88888888)),
                                                                       ('rf',
                                                                        RandomForestClassifier(class_weight='balanced',
                                                                                               criterion='entropy',
                                                                                               min_samples_leaf=3,
                                                                                               random_state=88888888)),
                                                                       ('svm',
                                                                        SVC(class_weight='b...
                                                           final_estimator=LogisticRegression(max_iter=5000,
                                                                                              random_state=88888888,
                                                                                              solver='saga')))]),
             n_jobs=-1,
             param_grid={'stacked_model__dt__max_depth': [3, 5],
                         'stacked_model__final_estimator__class_weight': ['balanced'],
                         'stacked_model__final_estimator__penalty': ['l1', 'l2',
                                                                     None],
                         'stacked_model__rf__max_depth': [3, 5],
                         'stacked_model__svm__C': [0.5, 1.0]},
             scoring='f1', verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('stacked_model',
                                        StackingClassifier(estimators=[('dt',
                                                                        DecisionTreeClassifier(class_weight='balanced',
                                                                                               criterion='entropy',
                                                                                               min_samples_leaf=3,
                                                                                               random_state=88888888)),
                                                                       ('rf',
                                                                        RandomForestClassifier(class_weight='balanced',
                                                                                               criterion='entropy',
                                                                                               min_samples_leaf=3,
                                                                                               random_state=88888888)),
                                                                       ('svm',
                                                                        SVC(class_weight='b...
                                                           final_estimator=LogisticRegression(max_iter=5000,
                                                                                              random_state=88888888,
                                                                                              solver='saga')))]),
             n_jobs=-1,
             param_grid={'stacked_model__dt__max_depth': [3, 5],
                         'stacked_model__final_estimator__class_weight': ['balanced'],
                         'stacked_model__final_estimator__penalty': ['l1', 'l2',
                                                                     None],
                         'stacked_model__rf__max_depth': [3, 5],
                         'stacked_model__svm__C': [0.5, 1.0]},
             scoring='f1', verbose=1)
Pipeline(steps=[('stacked_model',
                 StackingClassifier(estimators=[('dt',
                                                 DecisionTreeClassifier(class_weight='balanced',
                                                                        criterion='entropy',
                                                                        max_depth=3,
                                                                        min_samples_leaf=3,
                                                                        random_state=88888888)),
                                                ('rf',
                                                 RandomForestClassifier(class_weight='balanced',
                                                                        criterion='entropy',
                                                                        max_depth=5,
                                                                        min_samples_leaf=3,
                                                                        random_state=88888888)),
                                                ('svm',
                                                 SVC(C=0.5,
                                                     class_weight='balanced',
                                                     kernel='linear',
                                                     probability=True,
                                                     random_state=88888888))],
                                    final_estimator=LogisticRegression(class_weight='balanced',
                                                                       max_iter=5000,
                                                                       penalty='l1',
                                                                       random_state=88888888,
                                                                       solver='saga')))])
StackingClassifier(estimators=[('dt',
                                DecisionTreeClassifier(class_weight='balanced',
                                                       criterion='entropy',
                                                       max_depth=3,
                                                       min_samples_leaf=3,
                                                       random_state=88888888)),
                               ('rf',
                                RandomForestClassifier(class_weight='balanced',
                                                       criterion='entropy',
                                                       max_depth=5,
                                                       min_samples_leaf=3,
                                                       random_state=88888888)),
                               ('svm',
                                SVC(C=0.5, class_weight='balanced',
                                    kernel='linear', probability=True,
                                    random_state=88888888))],
                   final_estimator=LogisticRegression(class_weight='balanced',
                                                      max_iter=5000,
                                                      penalty='l1',
                                                      random_state=88888888,
                                                      solver='saga'))
DecisionTreeClassifier(class_weight='balanced', criterion='entropy',
                       max_depth=3, min_samples_leaf=3, random_state=88888888)
RandomForestClassifier(class_weight='balanced', criterion='entropy',
                       max_depth=5, min_samples_leaf=3, random_state=88888888)
SVC(C=0.5, class_weight='balanced', kernel='linear', probability=True,
    random_state=88888888)
LogisticRegression(class_weight='balanced', max_iter=5000, penalty='l1',
                   random_state=88888888, solver='saga')
In [138]:
##################################
# Identifying the best model
##################################
stacked_unbalanced_class_best_model_original = stacked_unbalanced_class_grid_search.best_estimator_
In [139]:
##################################
# Evaluating the F1 scores
# on the training, cross-validation, and validation data
##################################
stacked_unbalanced_class_best_model_original_f1_cv = stacked_unbalanced_class_grid_search.best_score_
stacked_unbalanced_class_best_model_original_f1_train = f1_score(y_train, stacked_unbalanced_class_best_model_original.predict(X_train))
stacked_unbalanced_class_best_model_original_f1_validation = f1_score(y_validation, stacked_unbalanced_class_best_model_original.predict(X_validation))
In [140]:
##################################
# Identifying the optimal model
##################################
print('Best Stacked Model using the Original Train Data: ')
print(f"Best Stacked Model Parameters: {stacked_unbalanced_class_grid_search.best_params_}")
Best Stacked Model using the Original Train Data: 
Best Stacked Model Parameters: {'stacked_model__dt__max_depth': 3, 'stacked_model__final_estimator__class_weight': 'balanced', 'stacked_model__final_estimator__penalty': 'l1', 'stacked_model__rf__max_depth': 5, 'stacked_model__svm__C': 0.5}
In [141]:
##################################
# Summarizing the F1 score results
# on the training and cross-validated data
# to assess overfitting optimism
##################################
print(f"F1 Score on Cross-Validated Data: {stacked_unbalanced_class_best_model_original_f1_cv:.4f}")
print(f"F1 Score on Training Data: {stacked_unbalanced_class_best_model_original_f1_train:.4f}")
print("\nClassification Report on Training Data:\n", classification_report(y_train, stacked_unbalanced_class_best_model_original.predict(X_train)))
F1 Score on Cross-Validated Data: 0.9125
F1 Score on Training Data: 0.9404

Classification Report on Training Data:
               precision    recall  f1-score   support

           0       0.56      1.00      0.72        22
           1       1.00      0.89      0.94       151

    accuracy                           0.90       173
   macro avg       0.78      0.94      0.83       173
weighted avg       0.94      0.90      0.91       173

In [142]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the training data
##################################
cm_raw = confusion_matrix(y_train, stacked_unbalanced_class_best_model_original.predict(X_train))
cm_normalized = confusion_matrix(y_train, stacked_unbalanced_class_best_model_original.predict(X_train), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Confusion Matrix (Raw Count): Best Stacked Model on Training Data')
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Confusion Matrix (Normalized): Best Stacked Model on Training Data')
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [143]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
##################################
print(f"F1 Score on Validation Data: {stacked_unbalanced_class_best_model_original_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_validation, stacked_unbalanced_class_best_model_original.predict(X_validation)))
F1 Score on Validation Data: 0.9149

Classification Report on Validation Data:
               precision    recall  f1-score   support

           0       0.47      1.00      0.64         7
           1       1.00      0.84      0.91        51

    accuracy                           0.86        58
   macro avg       0.73      0.92      0.78        58
weighted avg       0.94      0.86      0.88        58

In [144]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_validation, stacked_unbalanced_class_best_model_original.predict(X_validation))
cm_normalized = confusion_matrix(y_validation, stacked_unbalanced_class_best_model_original.predict(X_validation), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Confusion Matrix (Raw Count): Best Stacked Model on Validation Data')
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Confusion Matrix (Normalized): Best Stacked Model on Validation Data')
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [145]:
##################################
# Obtaining the logit values (log-odds)
# from the decision function for training data
##################################
stacked_unbalanced_class_best_model_original_logit_values = stacked_unbalanced_class_best_model_original.decision_function(X_train)
In [146]:
##################################
# Obtaining the estimated probabilities 
# for the positive class (LUNG_CANCER=YES) for training data
##################################
stacked_unbalanced_class_best_model_original_probabilities = stacked_unbalanced_class_best_model_original.predict_proba(X_train)[:, 1]
In [147]:
##################################
# Sorting the values to generate
# a smoother curve
##################################
stacked_unbalanced_class_best_model_original_sorted_indices = np.argsort(stacked_unbalanced_class_best_model_original_logit_values)
stacked_unbalanced_class_best_model_original_logit_values_sorted = stacked_unbalanced_class_best_model_original_logit_values[stacked_unbalanced_class_best_model_original_sorted_indices]
stacked_unbalanced_class_best_model_original_probabilities_sorted = stacked_unbalanced_class_best_model_original_probabilities[stacked_unbalanced_class_best_model_original_sorted_indices]
In [148]:
##################################
# Plotting the estimated logistic curve
# using the logit values
# and estimated probabilities
# obtained from the training data
##################################
plt.figure(figsize=(17, 8))
plt.plot(stacked_unbalanced_class_best_model_original_logit_values_sorted, 
         stacked_unbalanced_class_best_model_original_probabilities_sorted, label='Classification Model Logistic Curve', color='black')
plt.ylim(-0.05, 1.05)
plt.xlim(-8.00, 8.00)
target_0_indices = y_train == 0
target_1_indices = y_train == 1
plt.scatter(stacked_unbalanced_class_best_model_original_logit_values[target_0_indices], 
            stacked_unbalanced_class_best_model_original_probabilities[target_0_indices], 
            color='blue', alpha=0.40, s=100, marker= 'o', edgecolor='k', label='LUNG_CANCER=NO')
plt.scatter(stacked_unbalanced_class_best_model_original_logit_values[target_1_indices], 
            stacked_unbalanced_class_best_model_original_probabilities[target_1_indices], 
            color='red', alpha=0.40, s=100, marker='o', edgecolor='k', label='LUNG_CANCER=YES')
plt.axhline(0.5, color='green', linestyle='--', label='Classification Model Threshold')
plt.title('Logistic Curve (Original Training Data): Stacked Model')
plt.xlabel('Logit (Log-Odds)')
plt.ylabel('Estimated Lung Cancer Probability')
plt.grid(True)
plt.legend(loc='upper left')
plt.show()
No description has been provided for this image
In [149]:
##################################
# Saving the best stacked model
# developed from the original training data
################################## 
joblib.dump(stacked_unbalanced_class_best_model_original, 
            os.path.join("..", MODELS_PATH, "stacked_unbalanced_class_best_model_original.pkl"))
Out[149]:
['..\\models\\stacked_unbalanced_class_best_model_original.pkl']

1.6.5 Model Fitting using Upsampled Training Data | Hyperparameter Tuning | Validation ¶

1.6.5.1 Individual Classifier ¶

Synthetic Minority Oversampling Technique is specifically designed to increase the representation of the minority class by generating new minority instances between existing instances. The new instances created are not just the copy of existing minority cases, instead for each minority class instance, the algorithm generates synthetic examples by creating linear combinations of the feature vectors between that instance and its k nearest neighbors. The synthetic samples are placed along the line segments connecting the original instance to its neighbors.

Logistic Regression models the relationship between the probability of an event (among two outcome levels) by having the log-odds of the event be a linear combination of a set of predictors weighted by their respective parameter estimates. The parameters are estimated via maximum likelihood estimation by testing different values through multiple iterations to optimize for the best fit of log odds. All of these iterations produce the log likelihood function, and logistic regression seeks to maximize this function to find the best parameter estimates. Given the optimal parameters, the conditional probabilities for each observation can be calculated, logged, and summed together to yield a predicted probability.

Hyperparameter Tuning is an iterative process that involves experimenting with different hyperparameter combinations, evaluating the model's performance, and refining the hyperparameter values to achieve the best possible performance on new, unseen data - aimed at building effective and well-generalizing machine learning models. A model's performance depends not only on the learned parameters (weights) during training but also on hyperparameters, which are external configuration settings that cannot be learned from the data.

  1. The optimal logistic regression model (individual classifier) determined from the 5-fold cross-validation of train data (SMOTE-upsampled) contained the following hyperparameters:
    • penalty = L1
    • class_weight = none
    • solver = saga
    • max_iter = 500
    • random_state = 88888888
  2. The F1 scores estimated for the different data subsets were as follows:
    • train data (SMOTE-upsampled) = 0.9122
    • train data (cross-validated) = 0.9109
    • validation data = 0.9278
  3. Minimal overfitting noted based on the small difference in the apparent and cross-validated F1 scores.
In [150]:
##################################
# Fitting the model on the 
# upsampled training data
##################################
individual_balanced_class_grid_search.fit(X_train_smote, y_train_smote)
Fitting 5 folds for each of 3 candidates, totalling 15 fits
Out[150]:
GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('individual_model',
                                        LogisticRegression(max_iter=5000,
                                                           random_state=88888888,
                                                           solver='saga'))]),
             n_jobs=-1,
             param_grid={'individual_model__class_weight': [None],
                         'individual_model__penalty': ['l1', 'l2', None]},
             scoring='f1', verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('individual_model',
                                        LogisticRegression(max_iter=5000,
                                                           random_state=88888888,
                                                           solver='saga'))]),
             n_jobs=-1,
             param_grid={'individual_model__class_weight': [None],
                         'individual_model__penalty': ['l1', 'l2', None]},
             scoring='f1', verbose=1)
Pipeline(steps=[('individual_model',
                 LogisticRegression(max_iter=5000, penalty='l1',
                                    random_state=88888888, solver='saga'))])
LogisticRegression(max_iter=5000, penalty='l1', random_state=88888888,
                   solver='saga')
In [151]:
##################################
# Identifying the best model
##################################
individual_balanced_class_best_model_upsampled = individual_balanced_class_grid_search.best_estimator_
In [152]:
##################################
# Evaluating the F1 scores
# on the training, cross-validation, and validation data
##################################
individual_balanced_class_best_model_upsampled_f1_cv = individual_balanced_class_grid_search.best_score_
individual_balanced_class_best_model_upsampled_f1_train_smote = f1_score(y_train_smote, individual_balanced_class_best_model_upsampled.predict(X_train_smote))
individual_balanced_class_best_model_upsampled_f1_validation = f1_score(y_validation, individual_balanced_class_best_model_upsampled.predict(X_validation))
In [153]:
##################################
# Identifying the optimal model
##################################
print('Best Individual Model using the SMOTE-Upsampled Train Data: ')
print(f"Best Individual Model Parameters: {individual_balanced_class_grid_search.best_params_}")
Best Individual Model using the SMOTE-Upsampled Train Data: 
Best Individual Model Parameters: {'individual_model__class_weight': None, 'individual_model__penalty': 'l1'}
In [154]:
##################################
# Summarizing the F1 score results
# on the training and cross-validated data
# to assess overfitting optimism
##################################
print(f"F1 Score on Cross-Validated Data: {individual_balanced_class_best_model_upsampled_f1_cv:.4f}")
print(f"F1 Score on Training Data: {individual_balanced_class_best_model_upsampled_f1_train_smote:.4f}")
print("\nClassification Report on Training Data:\n", classification_report(y_train_smote, individual_balanced_class_best_model_upsampled.predict(X_train_smote)))
F1 Score on Cross-Validated Data: 0.9109
F1 Score on Training Data: 0.9122

Classification Report on Training Data:
               precision    recall  f1-score   support

           0       0.90      0.93      0.92       151
           1       0.93      0.89      0.91       151

    accuracy                           0.91       302
   macro avg       0.91      0.91      0.91       302
weighted avg       0.91      0.91      0.91       302

In [155]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the training data
##################################
cm_raw = confusion_matrix(y_train_smote, individual_balanced_class_best_model_upsampled.predict(X_train_smote))
cm_normalized = confusion_matrix(y_train_smote, individual_balanced_class_best_model_upsampled.predict(X_train_smote), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Confusion Matrix (Raw Count): Best Individual Model on Training Data')
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Confusion Matrix (Normalized): Best Individual Model on Training Data')
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [156]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
##################################
print(f"F1 Score on Validation Data: {individual_balanced_class_best_model_upsampled_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_validation, individual_balanced_class_best_model_upsampled.predict(X_validation)))
F1 Score on Validation Data: 0.9278

Classification Report on Validation Data:
               precision    recall  f1-score   support

           0       0.50      0.86      0.63         7
           1       0.98      0.88      0.93        51

    accuracy                           0.88        58
   macro avg       0.74      0.87      0.78        58
weighted avg       0.92      0.88      0.89        58

In [157]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_validation, individual_balanced_class_best_model_upsampled.predict(X_validation))
cm_normalized = confusion_matrix(y_validation, individual_balanced_class_best_model_upsampled.predict(X_validation), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Confusion Matrix (Raw Count): Best Individual Model on Validation Data')
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Confusion Matrix (Normalized): Best Individual Model on Validation Data')
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [158]:
##################################
# Obtaining the logit values (log-odds)
# from the decision function for training data
##################################
individual_balanced_class_best_model_upsampled_logit_values = individual_balanced_class_best_model_upsampled.decision_function(X_train_smote)
In [159]:
##################################
# Obtaining the estimated probabilities 
# for the positive class (LUNG_CANCER=YES) for training data
##################################
individual_balanced_class_best_model_upsampled_probabilities = individual_balanced_class_best_model_upsampled.predict_proba(X_train_smote)[:, 1]
In [160]:
##################################
# Sorting the values to generate
# a smoother curve
##################################
individual_balanced_class_best_model_upsampled_sorted_indices = np.argsort(individual_balanced_class_best_model_upsampled_logit_values)
individual_balanced_class_best_model_upsampled_logit_values_sorted = individual_balanced_class_best_model_upsampled_logit_values[individual_balanced_class_best_model_upsampled_sorted_indices]
individual_balanced_class_best_model_upsampled_probabilities_sorted = individual_balanced_class_best_model_upsampled_probabilities[individual_balanced_class_best_model_upsampled_sorted_indices]
In [161]:
##################################
# Plotting the estimated logistic curve
# using the logit values
# and estimated probabilities
# obtained from the training data
##################################
plt.figure(figsize=(17, 8))
plt.plot(individual_balanced_class_best_model_upsampled_logit_values_sorted, 
         individual_balanced_class_best_model_upsampled_probabilities_sorted, label='Classification Model Logistic Curve', color='black')
plt.ylim(-0.05, 1.05)
plt.xlim(-8.00, 8.00)
target_0_indices = y_train_smote == 0
target_1_indices = y_train_smote == 1
plt.scatter(individual_balanced_class_best_model_upsampled_logit_values[target_0_indices], 
            individual_balanced_class_best_model_upsampled_probabilities[target_0_indices], 
            color='blue', alpha=0.40, s=100, marker= 'o', edgecolor='k', label='LUNG_CANCER=NO')
plt.scatter(individual_balanced_class_best_model_upsampled_logit_values[target_1_indices], 
            individual_balanced_class_best_model_upsampled_probabilities[target_1_indices], 
            color='red', alpha=0.40, s=100, marker='o', edgecolor='k', label='LUNG_CANCER=YES')
plt.axhline(0.5, color='green', linestyle='--', label='Classification Model Threshold')
plt.title('Logistic Curve (Upsampled Training Data): Individual Model')
plt.xlabel('Logit (Log-Odds)')
plt.ylabel('Estimated Lung Cancer Probability')
plt.grid(True)
plt.legend(loc='upper left')
plt.show()
No description has been provided for this image
In [162]:
##################################
# Saving the best individual model
# developed from the upsampled training data
################################## 
joblib.dump(individual_balanced_class_best_model_upsampled, 
            os.path.join("..", MODELS_PATH, "individual_balanced_class_best_model_upsampled.pkl"))
Out[162]:
['..\\models\\individual_balanced_class_best_model_upsampled.pkl']

1.6.5.2 Stacked Classifier ¶

Synthetic Minority Oversampling Technique is specifically designed to increase the representation of the minority class by generating new minority instances between existing instances. The new instances created are not just the copy of existing minority cases, instead for each minority class instance, the algorithm generates synthetic examples by creating linear combinations of the feature vectors between that instance and its k nearest neighbors. The synthetic samples are placed along the line segments connecting the original instance to its neighbors.

Logistic Regression models the relationship between the probability of an event (among two outcome levels) by having the log-odds of the event be a linear combination of a set of predictors weighted by their respective parameter estimates. The parameters are estimated via maximum likelihood estimation by testing different values through multiple iterations to optimize for the best fit of log odds. All of these iterations produce the log likelihood function, and logistic regression seeks to maximize this function to find the best parameter estimates. Given the optimal parameters, the conditional probabilities for each observation can be calculated, logged, and summed together to yield a predicted probability.

Decision Trees create a model that predicts the class label of a sample based on input features. A decision tree consists of nodes that represent decisions or choices, edges which connect nodes and represent the possible outcomes of a decision and leaf (or terminal) nodes which represent the final decision or the predicted class label. The decision-making process involves feature selection (at each internal node, the algorithm decides which feature to split on based on a certain criterion including gini impurity or entropy), splitting criteria (the splitting criteria aim to find the feature and its corresponding threshold that best separates the data into different classes. The goal is to increase homogeneity within each resulting subset), recursive splitting (the process of feature selection and splitting continues recursively, creating a tree structure. The dataset is partitioned at each internal node based on the chosen feature, and the process repeats for each subset) and stopping criteria (the recursion stops when a certain condition is met, known as a stopping criterion. Common stopping criteria include a maximum depth for the tree, a minimum number of samples required to split a node, or a minimum number of samples in a leaf node.)

Random Forest is an ensemble learning method made up of a large set of small decision trees called estimators, with each producing its own prediction. The random forest model aggregates the predictions of the estimators to produce a more accurate prediction. The algorithm involves bootstrap aggregating (where smaller subsets of the training data are repeatedly subsampled with replacement), random subspacing (where a subset of features are sampled and used to train each individual estimator), estimator training (where unpruned decision trees are formulated for each estimator) and inference by aggregating the predictions of all estimators.

Support Vector Machine plots each observation in an N-dimensional space corresponding to the number of features in the data set and finds a hyperplane that maximally separates the different classes by a maximally large margin (which is defined as the distance between the hyperplane and the closest data points from each class). The algorithm applies kernel transformation by mapping non-linearly separable data using the similarities between the points in a high-dimensional feature space for improved discrimination.

Hyperparameter Tuning is an iterative process that involves experimenting with different hyperparameter combinations, evaluating the model's performance, and refining the hyperparameter values to achieve the best possible performance on new, unseen data - aimed at building effective and well-generalizing machine learning models. A model's performance depends not only on the learned parameters (weights) during training but also on hyperparameters, which are external configuration settings that cannot be learned from the data.

Model Stacking - also known as stacked generalization, is an ensemble approach which involves creating a variety of base learners and using them to create intermediate predictions, one for each learned model. A meta-model is incorporated that gains knowledge of the same target from intermediate predictions. Unlike bagging, in stacking, the models are typically different (e.g. not all decision trees) and fit on the same dataset (e.g. instead of samples of the training dataset). Unlike boosting, in stacking, a single model is used to learn how to best combine the predictions from the contributing models (e.g. instead of a sequence of models that correct the predictions of prior models). Stacking is appropriate when the predictions made by the base learners or the errors in predictions made by the models have minimal correlation. Achieving an improvement in performance is dependent upon the choice of base learners and whether they are sufficiently skillful in their predictions.

  1. The optimal decision tree model (base learner) determined from the 5-fold cross-validation of train data (SMOTE-upsampled) contained the following hyperparameters:
    • max_depth = 3
    • class_weight = none
    • criterion = entropy
    • min_samples_leaf = 3
    • random_state = 88888888
  2. The optimal random forest model (base learner) determined from the 5-fold cross-validation of train data (SMOTE-upsampled) contained the following hyperparameters:
    • max_depth = 5
    • class_weight = none
    • criterion = entropy
    • max_features = sqrt
    • min_samples_leaf = 3
    • random_state = 88888888
  3. The optimal support vector machine model (base learner) determined from the 5-fold cross-validation of train data (SMOTE-upsampled) contained the following hyperparameters:
    • C = 1.00
    • class_weight = none
    • kernel = linear
    • probability = true
    • random_state = 88888888
  4. The optimal logistic regression model (meta-learner) determined from the 5-fold cross-validation of train data (SMOTE-upsampled) contained the following hyperparameters:
    • penalty = none
    • class_weight = none
    • solver = saga
    • max_iter = 500
    • random_state = 88888888
  5. The F1 scores estimated for the different data subsets were as follows:
    • train data (SMOTE-upsampled) = 0.9568
    • train data (cross-validated) = 0.9489
    • validation data = 0.9615
  6. Minimal overfitting noted based on the small difference in the apparent and cross-validated F1 scores.
In [163]:
##################################
# Fitting the model on the 
# upsampled training data
##################################
stacked_balanced_class_grid_search.fit(X_train_smote, y_train_smote)
Fitting 5 folds for each of 24 candidates, totalling 120 fits
Out[163]:
GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('stacked_model',
                                        StackingClassifier(estimators=[('dt',
                                                                        DecisionTreeClassifier(criterion='entropy',
                                                                                               min_samples_leaf=3,
                                                                                               random_state=88888888)),
                                                                       ('rf',
                                                                        RandomForestClassifier(criterion='entropy',
                                                                                               min_samples_leaf=3,
                                                                                               random_state=88888888)),
                                                                       ('svm',
                                                                        SVC(kernel='linear',
                                                                            probability=True,
                                                                            random_state=88888888))],
                                                           final_estimator=LogisticRegression(max_iter=5000,
                                                                                              random_state=88888888,
                                                                                              solver='saga')))]),
             n_jobs=-1,
             param_grid={'stacked_model__dt__max_depth': [3, 5],
                         'stacked_model__final_estimator__class_weight': [None],
                         'stacked_model__final_estimator__penalty': ['l1', 'l2',
                                                                     None],
                         'stacked_model__rf__max_depth': [3, 5],
                         'stacked_model__svm__C': [0.5, 1.0]},
             scoring='f1', verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('stacked_model',
                                        StackingClassifier(estimators=[('dt',
                                                                        DecisionTreeClassifier(criterion='entropy',
                                                                                               min_samples_leaf=3,
                                                                                               random_state=88888888)),
                                                                       ('rf',
                                                                        RandomForestClassifier(criterion='entropy',
                                                                                               min_samples_leaf=3,
                                                                                               random_state=88888888)),
                                                                       ('svm',
                                                                        SVC(kernel='linear',
                                                                            probability=True,
                                                                            random_state=88888888))],
                                                           final_estimator=LogisticRegression(max_iter=5000,
                                                                                              random_state=88888888,
                                                                                              solver='saga')))]),
             n_jobs=-1,
             param_grid={'stacked_model__dt__max_depth': [3, 5],
                         'stacked_model__final_estimator__class_weight': [None],
                         'stacked_model__final_estimator__penalty': ['l1', 'l2',
                                                                     None],
                         'stacked_model__rf__max_depth': [3, 5],
                         'stacked_model__svm__C': [0.5, 1.0]},
             scoring='f1', verbose=1)
Pipeline(steps=[('stacked_model',
                 StackingClassifier(estimators=[('dt',
                                                 DecisionTreeClassifier(criterion='entropy',
                                                                        max_depth=3,
                                                                        min_samples_leaf=3,
                                                                        random_state=88888888)),
                                                ('rf',
                                                 RandomForestClassifier(criterion='entropy',
                                                                        max_depth=5,
                                                                        min_samples_leaf=3,
                                                                        random_state=88888888)),
                                                ('svm',
                                                 SVC(kernel='linear',
                                                     probability=True,
                                                     random_state=88888888))],
                                    final_estimator=LogisticRegression(max_iter=5000,
                                                                       penalty=None,
                                                                       random_state=88888888,
                                                                       solver='saga')))])
StackingClassifier(estimators=[('dt',
                                DecisionTreeClassifier(criterion='entropy',
                                                       max_depth=3,
                                                       min_samples_leaf=3,
                                                       random_state=88888888)),
                               ('rf',
                                RandomForestClassifier(criterion='entropy',
                                                       max_depth=5,
                                                       min_samples_leaf=3,
                                                       random_state=88888888)),
                               ('svm',
                                SVC(kernel='linear', probability=True,
                                    random_state=88888888))],
                   final_estimator=LogisticRegression(max_iter=5000,
                                                      penalty=None,
                                                      random_state=88888888,
                                                      solver='saga'))
DecisionTreeClassifier(criterion='entropy', max_depth=3, min_samples_leaf=3,
                       random_state=88888888)
RandomForestClassifier(criterion='entropy', max_depth=5, min_samples_leaf=3,
                       random_state=88888888)
SVC(kernel='linear', probability=True, random_state=88888888)
LogisticRegression(max_iter=5000, penalty=None, random_state=88888888,
                   solver='saga')
In [164]:
##################################
# Identifying the best model
##################################
stacked_balanced_class_best_model_upsampled = stacked_balanced_class_grid_search.best_estimator_
In [165]:
##################################
# Evaluating the F1 scores
# on the training, cross-validation, and validation data
##################################
stacked_balanced_class_best_model_upsampled_f1_cv = stacked_balanced_class_grid_search.best_score_
stacked_balanced_class_best_model_upsampled_f1_train_smote = f1_score(y_train_smote, stacked_balanced_class_best_model_upsampled.predict(X_train_smote))
stacked_balanced_class_best_model_upsampled_f1_validation = f1_score(y_validation, stacked_balanced_class_best_model_upsampled.predict(X_validation))
In [166]:
##################################
# Identifying the optimal model
##################################
print('Best Stacked Model using the SMOTE-Upsampled Train Data: ')
print(f"Best Stacked Model Parameters: {stacked_balanced_class_grid_search.best_params_}")
Best Stacked Model using the SMOTE-Upsampled Train Data: 
Best Stacked Model Parameters: {'stacked_model__dt__max_depth': 3, 'stacked_model__final_estimator__class_weight': None, 'stacked_model__final_estimator__penalty': None, 'stacked_model__rf__max_depth': 5, 'stacked_model__svm__C': 1.0}
In [167]:
##################################
# Summarizing the F1 score results
# on the training and cross-validated data
# to assess overfitting optimism
##################################
print(f"F1 Score on Cross-Validated Data: {stacked_balanced_class_best_model_upsampled_f1_cv:.4f}")
print(f"F1 Score on Training Data: {stacked_balanced_class_best_model_upsampled_f1_train_smote:.4f}")
print("\nClassification Report on Training Data:\n", classification_report(y_train_smote, stacked_balanced_class_best_model_upsampled.predict(X_train_smote)))
F1 Score on Cross-Validated Data: 0.9489
F1 Score on Training Data: 0.9568

Classification Report on Training Data:
               precision    recall  f1-score   support

           0       0.95      0.96      0.96       151
           1       0.96      0.95      0.96       151

    accuracy                           0.96       302
   macro avg       0.96      0.96      0.96       302
weighted avg       0.96      0.96      0.96       302

In [168]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the training data
##################################
cm_raw = confusion_matrix(y_train_smote, stacked_balanced_class_best_model_upsampled.predict(X_train_smote))
cm_normalized = confusion_matrix(y_train_smote, stacked_balanced_class_best_model_upsampled.predict(X_train_smote), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Confusion Matrix (Raw Count): Best Stacked Model on Training Data')
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Confusion Matrix (Normalized): Best Stacked Model on Training Data')
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [169]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
##################################
print(f"F1 Score on Validation Data: {stacked_balanced_class_best_model_upsampled_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_validation, stacked_balanced_class_best_model_upsampled.predict(X_validation)))
F1 Score on Validation Data: 0.9615

Classification Report on Validation Data:
               precision    recall  f1-score   support

           0       0.80      0.57      0.67         7
           1       0.94      0.98      0.96        51

    accuracy                           0.93        58
   macro avg       0.87      0.78      0.81        58
weighted avg       0.93      0.93      0.93        58

In [170]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_validation, stacked_balanced_class_best_model_upsampled.predict(X_validation))
cm_normalized = confusion_matrix(y_validation, stacked_balanced_class_best_model_upsampled.predict(X_validation), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Confusion Matrix (Raw Count): Best Stacked Model on Validation Data')
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Confusion Matrix (Normalized): Best Stacked Model on Validation Data')
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [171]:
##################################
# Obtaining the logit values (log-odds)
# from the decision function for training data
##################################
stacked_balanced_class_best_model_upsampled_logit_values = stacked_balanced_class_best_model_upsampled.decision_function(X_train_smote)
In [172]:
##################################
# Obtaining the estimated probabilities 
# for the positive class (LUNG_CANCER=YES) for training data
##################################
stacked_balanced_class_best_model_upsampled_probabilities = stacked_balanced_class_best_model_upsampled.predict_proba(X_train_smote)[:, 1]
In [173]:
##################################
# Sorting the values to generate
# a smoother curve
##################################
stacked_balanced_class_best_model_upsampled_sorted_indices = np.argsort(stacked_balanced_class_best_model_upsampled_logit_values)
stacked_balanced_class_best_model_upsampled_logit_values_sorted = stacked_balanced_class_best_model_upsampled_logit_values[stacked_balanced_class_best_model_upsampled_sorted_indices]
stacked_balanced_class_best_model_upsampled_probabilities_sorted = stacked_balanced_class_best_model_upsampled_probabilities[stacked_balanced_class_best_model_upsampled_sorted_indices]
In [174]:
##################################
# Plotting the estimated logistic curve
# using the logit values
# and estimated probabilities
# obtained from the training data
##################################
plt.figure(figsize=(17, 8))
plt.plot(stacked_balanced_class_best_model_upsampled_logit_values_sorted, 
         stacked_balanced_class_best_model_upsampled_probabilities_sorted, label='Classification Model Logistic Curve', color='black')
plt.ylim(-0.05, 1.05)
plt.xlim(-8.00, 8.00)
target_0_indices = y_train_smote == 0
target_1_indices = y_train_smote == 1
plt.scatter(stacked_balanced_class_best_model_upsampled_logit_values[target_0_indices], 
            stacked_balanced_class_best_model_upsampled_probabilities[target_0_indices], 
            color='blue', alpha=0.40, s=100, marker= 'o', edgecolor='k', label='LUNG_CANCER=NO')
plt.scatter(stacked_balanced_class_best_model_upsampled_logit_values[target_1_indices], 
            stacked_balanced_class_best_model_upsampled_probabilities[target_1_indices], 
            color='red', alpha=0.40, s=100, marker='o', edgecolor='k', label='LUNG_CANCER=YES')
plt.axhline(0.5, color='green', linestyle='--', label='Classification Model Threshold')
plt.title('Logistic Curve (Upsampled Training Data): Stacked Model')
plt.xlabel('Logit (Log-Odds)')
plt.ylabel('Estimated Lung Cancer Probability')
plt.grid(True)
plt.legend(loc='upper left')
plt.show()
No description has been provided for this image
In [175]:
##################################
# Saving the best stacked model
# developed from the upsampled training data
################################## 
joblib.dump(stacked_balanced_class_best_model_upsampled, 
            os.path.join("..", MODELS_PATH, "stacked_balanced_class_best_model_upsampled.pkl"))
Out[175]:
['..\\models\\stacked_balanced_class_best_model_upsampled.pkl']

1.6.6 Model Fitting using Downsampled Training Data | Hyperparameter Tuning | Validation ¶

1.6.6.1 Individual Classifier ¶

Condensed Nearest Neighbors is a prototype selection algorithm that aims to select a subset of instances from the original dataset, discarding redundant and less informative instances. The algorithm works by iteratively adding instances to the subset, starting with an empty set. At each iteration, an instance is added if it is not correctly classified by the current subset. The decision to add or discard an instance is based on its performance on a k-nearest neighbors classifier. If an instance is misclassified by the current subset's k-nearest neighbors, it is added to the subset. The process is repeated until no new instances are added to the subset. The resulting subset is a condensed representation of the dataset that retains the essential information needed for classification.

Logistic Regression models the relationship between the probability of an event (among two outcome levels) by having the log-odds of the event be a linear combination of a set of predictors weighted by their respective parameter estimates. The parameters are estimated via maximum likelihood estimation by testing different values through multiple iterations to optimize for the best fit of log odds. All of these iterations produce the log likelihood function, and logistic regression seeks to maximize this function to find the best parameter estimates. Given the optimal parameters, the conditional probabilities for each observation can be calculated, logged, and summed together to yield a predicted probability.

Class Weights are used to assign different levels of importance to different classes when the distribution of instances across different classes in a classification problem is not equal. By assigning higher weights to the minority class, the model is encouraged to give more attention to correctly predicting instances from the minority class. Class weights are incorporated into the loss function during training. The loss for each instance is multiplied by its corresponding class weight. This means that misclassifying an instance from the minority class will have a greater impact on the overall loss than misclassifying an instance from the majority class. The use of class weights helps balance the influence of each class during training, mitigating the impact of class imbalance. It provides a way to focus the learning process on the classes that are underrepresented in the training data.

Hyperparameter Tuning is an iterative process that involves experimenting with different hyperparameter combinations, evaluating the model's performance, and refining the hyperparameter values to achieve the best possible performance on new, unseen data - aimed at building effective and well-generalizing machine learning models. A model's performance depends not only on the learned parameters (weights) during training but also on hyperparameters, which are external configuration settings that cannot be learned from the data.

  1. The optimal logistic regression model (individual classifier) from the 5-fold cross-validation of train data (CNN-downsampled) contained the following hyperparameters:
    • penalty = L2
    • class_weight = balanced
    • solver = saga
    • max_iter = 500
    • random_state = 88888888
  2. The F1 scores estimated for the different data subsets were as follows:
    • train data (CNN-downsampled) = 0.8533
    • train data (cross-validated) = 0.7537
    • validation data = 0.9709
  3. High overfitting noted based on the large difference in the apparent and cross-validated F1 scores.
In [176]:
##################################
# Fitting the model on the 
# downsampled training data
##################################
individual_unbalanced_class_grid_search.fit(X_train_cnn, y_train_cnn)
Fitting 5 folds for each of 3 candidates, totalling 15 fits
Out[176]:
GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('individual_model',
                                        LogisticRegression(max_iter=5000,
                                                           random_state=88888888,
                                                           solver='saga'))]),
             n_jobs=-1,
             param_grid={'individual_model__class_weight': ['balanced'],
                         'individual_model__penalty': ['l1', 'l2', None]},
             scoring='f1', verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('individual_model',
                                        LogisticRegression(max_iter=5000,
                                                           random_state=88888888,
                                                           solver='saga'))]),
             n_jobs=-1,
             param_grid={'individual_model__class_weight': ['balanced'],
                         'individual_model__penalty': ['l1', 'l2', None]},
             scoring='f1', verbose=1)
Pipeline(steps=[('individual_model',
                 LogisticRegression(class_weight='balanced', max_iter=5000,
                                    random_state=88888888, solver='saga'))])
LogisticRegression(class_weight='balanced', max_iter=5000,
                   random_state=88888888, solver='saga')
In [177]:
##################################
# Identifying the best model
##################################
individual_unbalanced_class_best_model_downsampled = individual_unbalanced_class_grid_search.best_estimator_
In [178]:
##################################
# Evaluating the F1 scores
# on the training, cross-validation, and validation data
##################################
individual_unbalanced_class_best_model_downsampled_f1_cv = individual_unbalanced_class_grid_search.best_score_
individual_unbalanced_class_best_model_downsampled_f1_train_cnn = f1_score(y_train_cnn, individual_unbalanced_class_best_model_downsampled.predict(X_train_cnn))
individual_unbalanced_class_best_model_downsampled_f1_validation = f1_score(y_validation, individual_unbalanced_class_best_model_downsampled.predict(X_validation))
In [179]:
##################################
# Identifying the optimal model
##################################
print('Best Individual Model using the CNN-Downsampled Train Data: ')
print(f"Best Individual Model Parameters: {individual_unbalanced_class_grid_search.best_params_}")
Best Individual Model using the CNN-Downsampled Train Data: 
Best Individual Model Parameters: {'individual_model__class_weight': 'balanced', 'individual_model__penalty': 'l2'}
In [180]:
##################################
# Summarizing the F1 score results
# on the training and cross-validated data
# to assess overfitting optimism
##################################
print(f"F1 Score on Cross-Validated Data: {individual_unbalanced_class_best_model_downsampled_f1_cv:.4f}")
print(f"F1 Score on Training Data: {individual_unbalanced_class_best_model_downsampled_f1_train_cnn:.4f}")
print("\nClassification Report on Training Data:\n", classification_report(y_train_cnn, individual_unbalanced_class_best_model_downsampled.predict(X_train_cnn)))
F1 Score on Cross-Validated Data: 0.7537
F1 Score on Training Data: 0.8533

Classification Report on Training Data:
               precision    recall  f1-score   support

           0       0.72      0.82      0.77        22
           1       0.89      0.82      0.85        39

    accuracy                           0.82        61
   macro avg       0.80      0.82      0.81        61
weighted avg       0.83      0.82      0.82        61

In [181]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the training data
##################################
cm_raw = confusion_matrix(y_train_cnn, individual_unbalanced_class_best_model_downsampled.predict(X_train_cnn))
cm_normalized = confusion_matrix(y_train_cnn, individual_unbalanced_class_best_model_downsampled.predict(X_train_cnn), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Confusion Matrix (Raw Count): Best Individual Model on Training Data')
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Confusion Matrix (Normalized): Best Individual Model on Training Data')
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [182]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
##################################
print(f"F1 Score on Validation Data: {individual_unbalanced_class_best_model_downsampled_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_validation, individual_unbalanced_class_best_model_downsampled.predict(X_validation)))
F1 Score on Validation Data: 0.9709

Classification Report on Validation Data:
               precision    recall  f1-score   support

           0       0.83      0.71      0.77         7
           1       0.96      0.98      0.97        51

    accuracy                           0.95        58
   macro avg       0.90      0.85      0.87        58
weighted avg       0.95      0.95      0.95        58

In [183]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_validation, individual_unbalanced_class_best_model_downsampled.predict(X_validation))
cm_normalized = confusion_matrix(y_validation, individual_unbalanced_class_best_model_downsampled.predict(X_validation), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Confusion Matrix (Raw Count): Best Individual Model on Validation Data')
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Confusion Matrix (Normalized): Best Individual Model on Validation Data')
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [184]:
##################################
# Obtaining the logit values (log-odds)
# from the decision function for training data
##################################
individual_unbalanced_class_best_model_downsampled_logit_values = individual_unbalanced_class_best_model_downsampled.decision_function(X_train_cnn)
In [185]:
##################################
# Obtaining the estimated probabilities 
# for the positive class (LUNG_CANCER=YES) for training data
##################################
individual_unbalanced_class_best_model_downsampled_probabilities = individual_unbalanced_class_best_model_downsampled.predict_proba(X_train_cnn)[:, 1]
In [186]:
##################################
# Sorting the values to generate
# a smoother curve
##################################
individual_unbalanced_class_best_model_downsampled_sorted_indices = np.argsort(individual_unbalanced_class_best_model_downsampled_logit_values)
individual_unbalanced_class_best_model_downsampled_logit_values_sorted = individual_unbalanced_class_best_model_downsampled_logit_values[individual_unbalanced_class_best_model_downsampled_sorted_indices]
individual_unbalanced_class_best_model_downsampled_probabilities_sorted = individual_unbalanced_class_best_model_downsampled_probabilities[individual_unbalanced_class_best_model_downsampled_sorted_indices]
In [187]:
##################################
# Plotting the estimated logistic curve
# using the logit values
# and estimated probabilities
# obtained from the training data
##################################
plt.figure(figsize=(17, 8))
plt.plot(individual_unbalanced_class_best_model_downsampled_logit_values_sorted, 
         individual_unbalanced_class_best_model_downsampled_probabilities_sorted, label='Classification Model Logistic Curve', color='black')
plt.ylim(-0.05, 1.05)
plt.xlim(-8.00, 8.00)
target_0_indices = y_train_cnn == 0
target_1_indices = y_train_cnn == 1
plt.scatter(individual_unbalanced_class_best_model_downsampled_logit_values[target_0_indices], 
            individual_unbalanced_class_best_model_downsampled_probabilities[target_0_indices], 
            color='blue', alpha=0.40, s=100, marker= 'o', edgecolor='k', label='LUNG_CANCER=NO')
plt.scatter(individual_unbalanced_class_best_model_downsampled_logit_values[target_1_indices], 
            individual_unbalanced_class_best_model_downsampled_probabilities[target_1_indices], 
            color='red', alpha=0.40, s=100, marker='o', edgecolor='k', label='LUNG_CANCER=YES')
plt.axhline(0.5, color='green', linestyle='--', label='Classification Model Threshold')
plt.title('Logistic Curve (Downsampled Training Data): Individual Model')
plt.xlabel('Logit (Log-Odds)')
plt.ylabel('Estimated Lung Cancer Probability')
plt.grid(True)
plt.legend(loc='upper left')
plt.show()
No description has been provided for this image
In [188]:
##################################
# Saving the best individual model
# developed from the downsampled training data
################################## 
joblib.dump(individual_unbalanced_class_best_model_downsampled, 
            os.path.join("..", MODELS_PATH, "individual_unbalanced_class_best_model_downsampled.pkl"))
Out[188]:
['..\\models\\individual_unbalanced_class_best_model_downsampled.pkl']

1.6.6.2 Stacked Classifier ¶

Condensed Nearest Neighbors is a prototype selection algorithm that aims to select a subset of instances from the original dataset, discarding redundant and less informative instances. The algorithm works by iteratively adding instances to the subset, starting with an empty set. At each iteration, an instance is added if it is not correctly classified by the current subset. The decision to add or discard an instance is based on its performance on a k-nearest neighbors classifier. If an instance is misclassified by the current subset's k-nearest neighbors, it is added to the subset. The process is repeated until no new instances are added to the subset. The resulting subset is a condensed representation of the dataset that retains the essential information needed for classification.

Logistic Regression models the relationship between the probability of an event (among two outcome levels) by having the log-odds of the event be a linear combination of a set of predictors weighted by their respective parameter estimates. The parameters are estimated via maximum likelihood estimation by testing different values through multiple iterations to optimize for the best fit of log odds. All of these iterations produce the log likelihood function, and logistic regression seeks to maximize this function to find the best parameter estimates. Given the optimal parameters, the conditional probabilities for each observation can be calculated, logged, and summed together to yield a predicted probability.

Decision Trees create a model that predicts the class label of a sample based on input features. A decision tree consists of nodes that represent decisions or choices, edges which connect nodes and represent the possible outcomes of a decision and leaf (or terminal) nodes which represent the final decision or the predicted class label. The decision-making process involves feature selection (at each internal node, the algorithm decides which feature to split on based on a certain criterion including gini impurity or entropy), splitting criteria (the splitting criteria aim to find the feature and its corresponding threshold that best separates the data into different classes. The goal is to increase homogeneity within each resulting subset), recursive splitting (the process of feature selection and splitting continues recursively, creating a tree structure. The dataset is partitioned at each internal node based on the chosen feature, and the process repeats for each subset) and stopping criteria (the recursion stops when a certain condition is met, known as a stopping criterion. Common stopping criteria include a maximum depth for the tree, a minimum number of samples required to split a node, or a minimum number of samples in a leaf node.)

Random Forest is an ensemble learning method made up of a large set of small decision trees called estimators, with each producing its own prediction. The random forest model aggregates the predictions of the estimators to produce a more accurate prediction. The algorithm involves bootstrap aggregating (where smaller subsets of the training data are repeatedly subsampled with replacement), random subspacing (where a subset of features are sampled and used to train each individual estimator), estimator training (where unpruned decision trees are formulated for each estimator) and inference by aggregating the predictions of all estimators.

Support Vector Machine plots each observation in an N-dimensional space corresponding to the number of features in the data set and finds a hyperplane that maximally separates the different classes by a maximally large margin (which is defined as the distance between the hyperplane and the closest data points from each class). The algorithm applies kernel transformation by mapping non-linearly separable data using the similarities between the points in a high-dimensional feature space for improved discrimination.

Class Weights are used to assign different levels of importance to different classes when the distribution of instances across different classes in a classification problem is not equal. By assigning higher weights to the minority class, the model is encouraged to give more attention to correctly predicting instances from the minority class. Class weights are incorporated into the loss function during training. The loss for each instance is multiplied by its corresponding class weight. This means that misclassifying an instance from the minority class will have a greater impact on the overall loss than misclassifying an instance from the majority class. The use of class weights helps balance the influence of each class during training, mitigating the impact of class imbalance. It provides a way to focus the learning process on the classes that are underrepresented in the training data.

Hyperparameter Tuning is an iterative process that involves experimenting with different hyperparameter combinations, evaluating the model's performance, and refining the hyperparameter values to achieve the best possible performance on new, unseen data - aimed at building effective and well-generalizing machine learning models. A model's performance depends not only on the learned parameters (weights) during training but also on hyperparameters, which are external configuration settings that cannot be learned from the data.

Model Stacking - also known as stacked generalization, is an ensemble approach which involves creating a variety of base learners and using them to create intermediate predictions, one for each learned model. A meta-model is incorporated that gains knowledge of the same target from intermediate predictions. Unlike bagging, in stacking, the models are typically different (e.g. not all decision trees) and fit on the same dataset (e.g. instead of samples of the training dataset). Unlike boosting, in stacking, a single model is used to learn how to best combine the predictions from the contributing models (e.g. instead of a sequence of models that correct the predictions of prior models). Stacking is appropriate when the predictions made by the base learners or the errors in predictions made by the models have minimal correlation. Achieving an improvement in performance is dependent upon the choice of base learners and whether they are sufficiently skillful in their predictions.

  1. The optimal decision tree model (base learner) determined from the 5-fold cross-validation of train data (CNN-downsampled) contained the following hyperparameters:
    • max_depth = 3
    • class_weight = balanced
    • criterion = entropy
    • min_samples_leaf = 3
    • random_state = 88888888
  2. The optimal random forest model (base learner) determined from the 5-fold cross-validation of train data (CNN-downsampled) contained the following hyperparameters:
    • max_depth = 3
    • class_weight = balanced
    • criterion = entropy
    • max_features = sqrt
    • min_samples_leaf = 3
    • random_state = 88888888
  3. The optimal support vector machine model (base learner) determined from the 5-fold cross-validation of train data (CNN-downsampled) contained the following hyperparameters:
    • C = 1.00
    • class_weight = balanced
    • kernel = linear
    • probability = true
    • random_state = 88888888
  4. The optimal logistic regression model (meta-learner) determined from the 5-fold cross-validation of train data (CNN-downsampled) contained the following hyperparameters:
    • penalty = none
    • class_weight = balanced
    • solver = saga
    • max_iter = 500
    • random_state = 88888888
  5. The F1 scores estimated for the different data subsets were as follows:
    • train data (CNN-downsampled) = 0.8219
    • train data (cross-validated) = 0.7531
    • validation data = 0.9524
  6. High overfitting noted based on the large difference in the apparent and cross-validated F1 scores.
In [189]:
##################################
# Fitting the model on the 
# downsampled training data
##################################
stacked_unbalanced_class_grid_search.fit(X_train_cnn, y_train_cnn)
Fitting 5 folds for each of 24 candidates, totalling 120 fits
Out[189]:
GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('stacked_model',
                                        StackingClassifier(estimators=[('dt',
                                                                        DecisionTreeClassifier(class_weight='balanced',
                                                                                               criterion='entropy',
                                                                                               min_samples_leaf=3,
                                                                                               random_state=88888888)),
                                                                       ('rf',
                                                                        RandomForestClassifier(class_weight='balanced',
                                                                                               criterion='entropy',
                                                                                               min_samples_leaf=3,
                                                                                               random_state=88888888)),
                                                                       ('svm',
                                                                        SVC(class_weight='b...
                                                           final_estimator=LogisticRegression(max_iter=5000,
                                                                                              random_state=88888888,
                                                                                              solver='saga')))]),
             n_jobs=-1,
             param_grid={'stacked_model__dt__max_depth': [3, 5],
                         'stacked_model__final_estimator__class_weight': ['balanced'],
                         'stacked_model__final_estimator__penalty': ['l1', 'l2',
                                                                     None],
                         'stacked_model__rf__max_depth': [3, 5],
                         'stacked_model__svm__C': [0.5, 1.0]},
             scoring='f1', verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('stacked_model',
                                        StackingClassifier(estimators=[('dt',
                                                                        DecisionTreeClassifier(class_weight='balanced',
                                                                                               criterion='entropy',
                                                                                               min_samples_leaf=3,
                                                                                               random_state=88888888)),
                                                                       ('rf',
                                                                        RandomForestClassifier(class_weight='balanced',
                                                                                               criterion='entropy',
                                                                                               min_samples_leaf=3,
                                                                                               random_state=88888888)),
                                                                       ('svm',
                                                                        SVC(class_weight='b...
                                                           final_estimator=LogisticRegression(max_iter=5000,
                                                                                              random_state=88888888,
                                                                                              solver='saga')))]),
             n_jobs=-1,
             param_grid={'stacked_model__dt__max_depth': [3, 5],
                         'stacked_model__final_estimator__class_weight': ['balanced'],
                         'stacked_model__final_estimator__penalty': ['l1', 'l2',
                                                                     None],
                         'stacked_model__rf__max_depth': [3, 5],
                         'stacked_model__svm__C': [0.5, 1.0]},
             scoring='f1', verbose=1)
Pipeline(steps=[('stacked_model',
                 StackingClassifier(estimators=[('dt',
                                                 DecisionTreeClassifier(class_weight='balanced',
                                                                        criterion='entropy',
                                                                        max_depth=3,
                                                                        min_samples_leaf=3,
                                                                        random_state=88888888)),
                                                ('rf',
                                                 RandomForestClassifier(class_weight='balanced',
                                                                        criterion='entropy',
                                                                        max_depth=3,
                                                                        min_samples_leaf=3,
                                                                        random_state=88888888)),
                                                ('svm',
                                                 SVC(class_weight='balanced',
                                                     kernel='linear',
                                                     probability=True,
                                                     random_state=88888888))],
                                    final_estimator=LogisticRegression(class_weight='balanced',
                                                                       max_iter=5000,
                                                                       penalty=None,
                                                                       random_state=88888888,
                                                                       solver='saga')))])
StackingClassifier(estimators=[('dt',
                                DecisionTreeClassifier(class_weight='balanced',
                                                       criterion='entropy',
                                                       max_depth=3,
                                                       min_samples_leaf=3,
                                                       random_state=88888888)),
                               ('rf',
                                RandomForestClassifier(class_weight='balanced',
                                                       criterion='entropy',
                                                       max_depth=3,
                                                       min_samples_leaf=3,
                                                       random_state=88888888)),
                               ('svm',
                                SVC(class_weight='balanced', kernel='linear',
                                    probability=True, random_state=88888888))],
                   final_estimator=LogisticRegression(class_weight='balanced',
                                                      max_iter=5000,
                                                      penalty=None,
                                                      random_state=88888888,
                                                      solver='saga'))
DecisionTreeClassifier(class_weight='balanced', criterion='entropy',
                       max_depth=3, min_samples_leaf=3, random_state=88888888)
RandomForestClassifier(class_weight='balanced', criterion='entropy',
                       max_depth=3, min_samples_leaf=3, random_state=88888888)
SVC(class_weight='balanced', kernel='linear', probability=True,
    random_state=88888888)
LogisticRegression(class_weight='balanced', max_iter=5000, penalty=None,
                   random_state=88888888, solver='saga')
In [190]:
##################################
# Identifying the best model
##################################
stacked_unbalanced_class_best_model_downsampled = stacked_unbalanced_class_grid_search.best_estimator_
In [191]:
##################################
# Evaluating the F1 scores
# on the training, cross-validation, and validation data
##################################
stacked_unbalanced_class_best_model_downsampled_f1_cv = stacked_unbalanced_class_grid_search.best_score_
stacked_unbalanced_class_best_model_downsampled_f1_train_cnn = f1_score(y_train_cnn, stacked_unbalanced_class_best_model_downsampled.predict(X_train_cnn))
stacked_unbalanced_class_best_model_downsampled_f1_validation = f1_score(y_validation, stacked_unbalanced_class_best_model_downsampled.predict(X_validation))
In [192]:
##################################
# Identifying the optimal model
##################################
print('Best Stacked Model using the CNN-Downsampled Train Data: ')
print(f"Best Stacked Model Parameters: {stacked_unbalanced_class_grid_search.best_params_}")
Best Stacked Model using the CNN-Downsampled Train Data: 
Best Stacked Model Parameters: {'stacked_model__dt__max_depth': 3, 'stacked_model__final_estimator__class_weight': 'balanced', 'stacked_model__final_estimator__penalty': None, 'stacked_model__rf__max_depth': 3, 'stacked_model__svm__C': 1.0}
In [193]:
##################################
# Summarizing the F1 score results
# on the training and cross-validated data
# to assess overfitting optimism
##################################
print(f"F1 Score on Cross-Validated Data: {stacked_unbalanced_class_best_model_downsampled_f1_cv:.4f}")
print(f"F1 Score on Training Data: {stacked_unbalanced_class_best_model_downsampled_f1_train_cnn:.4f}")
print("\nClassification Report on Training Data:\n", classification_report(y_train_cnn, stacked_unbalanced_class_best_model_downsampled.predict(X_train_cnn)))
F1 Score on Cross-Validated Data: 0.7531
F1 Score on Training Data: 0.8219

Classification Report on Training Data:
               precision    recall  f1-score   support

           0       0.67      0.82      0.73        22
           1       0.88      0.77      0.82        39

    accuracy                           0.79        61
   macro avg       0.77      0.79      0.78        61
weighted avg       0.80      0.79      0.79        61

In [194]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the training data
##################################
cm_raw = confusion_matrix(y_train_cnn, stacked_unbalanced_class_best_model_downsampled.predict(X_train_cnn))
cm_normalized = confusion_matrix(y_train_cnn, stacked_unbalanced_class_best_model_downsampled.predict(X_train_cnn), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Confusion Matrix (Raw Count): Best Stacked Model on Training Data')
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Confusion Matrix (Normalized): Best Stacked Model on Training Data')
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [195]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
##################################
print(f"F1 Score on Validation Data: {stacked_unbalanced_class_best_model_downsampled_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_validation, stacked_unbalanced_class_best_model_downsampled.predict(X_validation)))
F1 Score on Validation Data: 0.9524

Classification Report on Validation Data:
               precision    recall  f1-score   support

           0       0.75      0.43      0.55         7
           1       0.93      0.98      0.95        51

    accuracy                           0.91        58
   macro avg       0.84      0.70      0.75        58
weighted avg       0.90      0.91      0.90        58

In [196]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_validation, stacked_unbalanced_class_best_model_downsampled.predict(X_validation))
cm_normalized = confusion_matrix(y_validation, stacked_unbalanced_class_best_model_downsampled.predict(X_validation), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Confusion Matrix (Raw Count): Best Stacked Model on Validation Data')
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Confusion Matrix (Normalized): Best Stacked Model on Validation Data')
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [197]:
##################################
# Obtaining the logit values (log-odds)
# from the decision function for training data
##################################
stacked_unbalanced_class_best_model_downsampled_logit_values = stacked_unbalanced_class_best_model_downsampled.decision_function(X_train_cnn)
In [198]:
##################################
# Obtaining the estimated probabilities 
# for the positive class (LUNG_CANCER=YES) for training data
##################################
stacked_unbalanced_class_best_model_downsampled_probabilities = stacked_unbalanced_class_best_model_downsampled.predict_proba(X_train_cnn)[:, 1]
In [199]:
##################################
# Sorting the values to generate
# a smoother curve
##################################
stacked_unbalanced_class_best_model_downsampled_sorted_indices = np.argsort(stacked_unbalanced_class_best_model_downsampled_logit_values)
stacked_unbalanced_class_best_model_downsampled_logit_values_sorted = stacked_unbalanced_class_best_model_downsampled_logit_values[stacked_unbalanced_class_best_model_downsampled_sorted_indices]
stacked_unbalanced_class_best_model_downsampled_probabilities_sorted = stacked_unbalanced_class_best_model_downsampled_probabilities[stacked_unbalanced_class_best_model_downsampled_sorted_indices]
In [200]:
##################################
# Plotting the estimated logistic curve
# using the logit values
# and estimated probabilities
# obtained from the training data
##################################
plt.figure(figsize=(17, 8))
plt.plot(stacked_unbalanced_class_best_model_downsampled_logit_values_sorted, 
         stacked_unbalanced_class_best_model_downsampled_probabilities_sorted, label='Classification Model Logistic Curve', color='black')
plt.ylim(-0.05, 1.05)
plt.xlim(-8.00, 8.00)
target_0_indices = y_train_cnn == 0
target_1_indices = y_train_cnn == 1
plt.scatter(stacked_unbalanced_class_best_model_downsampled_logit_values[target_0_indices], 
            stacked_unbalanced_class_best_model_downsampled_probabilities[target_0_indices], 
            color='blue', alpha=0.40, s=100, marker= 'o', edgecolor='k', label='LUNG_CANCER=NO')
plt.scatter(stacked_unbalanced_class_best_model_downsampled_logit_values[target_1_indices], 
            stacked_unbalanced_class_best_model_downsampled_probabilities[target_1_indices], 
            color='red', alpha=0.40, s=100, marker='o', edgecolor='k', label='LUNG_CANCER=YES')
plt.axhline(0.5, color='green', linestyle='--', label='Classification Model Threshold')
plt.title('Logistic Curve (Downsampled Training Data): Stacked Model')
plt.xlabel('Logit (Log-Odds)')
plt.ylabel('Estimated Lung Cancer Probability')
plt.grid(True)
plt.legend(loc='upper left')
plt.show()
No description has been provided for this image
In [201]:
##################################
# Saving the best stacked model
# developed from the downsampled training data
################################## 
joblib.dump(stacked_unbalanced_class_best_model_downsampled, 
            os.path.join("..", MODELS_PATH, "stacked_unbalanced_class_best_model_downsampled.pkl"))
Out[201]:
['..\\models\\stacked_unbalanced_class_best_model_downsampled.pkl']

1.6.7 Model Selection ¶

  1. The stacked classifier developed from the train data (SMOTE-upsampled) was selected as the final model by demonstrating the best validation F1 score with minimal overfitting :
    • train data (SMOTE-upsampled) = 0.9568
    • train data (cross-validated) = 0.9488
    • validation data = 0.9615
  2. The final model configuration are described as follows:
    • Base learner: decision tree model with optimal hyperparameters:
      • max_depth = 3
      • class_weight = none
      • criterion = entropy
      • min_samples_leaf = 3
      • random_state = 88888888
    • Base learner: random forest model with optimal hyperparameters:
      • max_depth = 5
      • class_weight = none
      • criterion = entropy
      • max_features = sqrt
      • min_samples_leaf = 3
      • random_state = 88888888
    • Base learner: support vector machine model with optimal hyperparameters:
      • C = 1.00
      • class_weight = none
      • kernel = linear
      • probability = true
      • random_state = 88888888
    • Meta-learner: logistic regression model with optimal hyperparameters:
      • penalty = none
      • class_weight = none
      • solver = saga
      • max_iter = 500
      • random_state = 88888888
In [202]:
##################################
# Gathering the F1 scores from 
# training, cross-validation and validation
##################################
set_labels = ['Train','Cross-Validation','Validation']
f1_plot = pd.DataFrame({'INDIVIDUAL_ORIGINAL_TRAIN': list([individual_unbalanced_class_best_model_original_f1_train,
                                                           individual_unbalanced_class_best_model_original_f1_cv,
                                                           individual_unbalanced_class_best_model_original_f1_validation]),
                        'STACKED_ORIGINAL_TRAIN': list([stacked_unbalanced_class_best_model_original_f1_train,
                                                        stacked_unbalanced_class_best_model_original_f1_cv,
                                                        stacked_unbalanced_class_best_model_original_f1_validation]),
                        'INDIVIDUAL_UPSAMPLED_TRAIN': list([individual_balanced_class_best_model_upsampled_f1_train_smote,
                                                           individual_balanced_class_best_model_upsampled_f1_cv,
                                                           individual_balanced_class_best_model_upsampled_f1_validation]),
                        'STACKED_UPSAMPLED_TRAIN': list([stacked_balanced_class_best_model_upsampled_f1_train_smote,
                                                        stacked_balanced_class_best_model_upsampled_f1_cv,
                                                        stacked_balanced_class_best_model_upsampled_f1_validation]),
                        'INDIVIDUAL_DOWNSAMPLED_TRAIN': list([individual_unbalanced_class_best_model_downsampled_f1_train_cnn,
                                                              individual_unbalanced_class_best_model_downsampled_f1_cv,
                                                              individual_unbalanced_class_best_model_downsampled_f1_validation]),
                        'STACKED_DOWNSAMPLED_TRAIN': list([stacked_unbalanced_class_best_model_downsampled_f1_train_cnn,
                                                           stacked_unbalanced_class_best_model_downsampled_f1_cv,
                                                           stacked_unbalanced_class_best_model_downsampled_f1_validation])},
                       index = set_labels)
display(f1_plot)
INDIVIDUAL_ORIGINAL_TRAIN STACKED_ORIGINAL_TRAIN INDIVIDUAL_UPSAMPLED_TRAIN STACKED_UPSAMPLED_TRAIN INDIVIDUAL_DOWNSAMPLED_TRAIN STACKED_DOWNSAMPLED_TRAIN
Train 0.930556 0.940351 0.912162 0.956811 0.853333 0.821918
Cross-Validation 0.911574 0.912498 0.910870 0.948878 0.753711 0.753114
Validation 0.949495 0.914894 0.927835 0.961538 0.970874 0.952381
In [203]:
##################################
# Plotting all the F1 scores
# for all models
##################################
f1_plot = f1_plot.plot.barh(figsize=(10, 6), width=0.90)
f1_plot.set_xlim(0.00,1.00)
f1_plot.set_title("Classification Model Comparison by F1 Score")
f1_plot.set_xlabel("F1 Score")
f1_plot.set_ylabel("Data Set")
f1_plot.grid(False)
f1_plot.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
for container in f1_plot.containers:
    f1_plot.bar_label(container, fmt='%.5f', padding=-50, color='white', fontweight='bold')
No description has been provided for this image

1.6.8 Model Testing ¶

  1. The selected stacked classifier developed from the train data (SMOTE-upsampled) also demonstrated a high F1 score on the independent test dataset:
    • train data (SMOTE-upsampled) = 0.9568
    • train data (cross-validated) = 0.9488
    • validation data = 0.9615
    • test data = 0.9489
In [204]:
##################################
# Evaluating the F1 scores
# on the test data
##################################
individual_unbalanced_class_best_model_original_f1_test = f1_score(y_test, individual_unbalanced_class_best_model_original.predict(X_test))
stacked_unbalanced_class_best_model_original_f1_test = f1_score(y_test, stacked_unbalanced_class_best_model_original.predict(X_test))
individual_balanced_class_best_model_upsampled_f1_test = f1_score(y_test, individual_balanced_class_best_model_upsampled.predict(X_test))
stacked_balanced_class_best_model_upsampled_f1_test = f1_score(y_test, stacked_balanced_class_best_model_upsampled.predict(X_test))
individual_unbalanced_class_best_model_downsampled_f1_test = f1_score(y_test, individual_unbalanced_class_best_model_downsampled.predict(X_test))
stacked_unbalanced_class_best_model_downsampled_f1_test = f1_score(y_test, stacked_unbalanced_class_best_model_downsampled.predict(X_test))
In [205]:
##################################
# Adding the the F1 score estimated
# from the test data
##################################
set_labels = ['Train','Cross-Validation','Validation','Test']
updated_f1_plot = pd.DataFrame({'INDIVIDUAL_ORIGINAL_TRAIN': list([individual_unbalanced_class_best_model_original_f1_train,
                                                                   individual_unbalanced_class_best_model_original_f1_cv,
                                                                   individual_unbalanced_class_best_model_original_f1_validation,
                                                                   individual_unbalanced_class_best_model_original_f1_test]),
                                'STACKED_ORIGINAL_TRAIN': list([stacked_unbalanced_class_best_model_original_f1_train,
                                                                stacked_unbalanced_class_best_model_original_f1_cv,
                                                                stacked_unbalanced_class_best_model_original_f1_validation,
                                                               stacked_unbalanced_class_best_model_original_f1_test]),
                                'INDIVIDUAL_UPSAMPLED_TRAIN': list([individual_balanced_class_best_model_upsampled_f1_train_smote,
                                                                    individual_balanced_class_best_model_upsampled_f1_cv,
                                                                    individual_balanced_class_best_model_upsampled_f1_validation,
                                                                   individual_balanced_class_best_model_upsampled_f1_test]),
                                'STACKED_UPSAMPLED_TRAIN': list([stacked_balanced_class_best_model_upsampled_f1_train_smote,
                                                                 stacked_balanced_class_best_model_upsampled_f1_cv,
                                                                 stacked_balanced_class_best_model_upsampled_f1_validation,
                                                                stacked_balanced_class_best_model_upsampled_f1_test]),
                                'INDIVIDUAL_DOWNSAMPLED_TRAIN': list([individual_unbalanced_class_best_model_downsampled_f1_train_cnn,
                                                                      individual_unbalanced_class_best_model_downsampled_f1_cv,
                                                                      individual_unbalanced_class_best_model_downsampled_f1_validation,
                                                                      individual_unbalanced_class_best_model_downsampled_f1_test]),
                                'STACKED_DOWNSAMPLED_TRAIN': list([stacked_unbalanced_class_best_model_downsampled_f1_train_cnn,
                                                                   stacked_unbalanced_class_best_model_downsampled_f1_cv,
                                                                   stacked_unbalanced_class_best_model_downsampled_f1_validation,
                                                                  stacked_unbalanced_class_best_model_downsampled_f1_test])},
                               index = set_labels)
display(updated_f1_plot)
INDIVIDUAL_ORIGINAL_TRAIN STACKED_ORIGINAL_TRAIN INDIVIDUAL_UPSAMPLED_TRAIN STACKED_UPSAMPLED_TRAIN INDIVIDUAL_DOWNSAMPLED_TRAIN STACKED_DOWNSAMPLED_TRAIN
Train 0.930556 0.940351 0.912162 0.956811 0.853333 0.821918
Cross-Validation 0.911574 0.912498 0.910870 0.948878 0.753711 0.753114
Validation 0.949495 0.914894 0.927835 0.961538 0.970874 0.952381
Test 0.904762 0.878049 0.890625 0.948905 0.939394 0.916031
In [206]:
##################################
# Plotting all the F1 scores
# for all models
##################################
updated_f1_plot = updated_f1_plot.plot.barh(figsize=(10, 8), width=0.90)
updated_f1_plot.set_xlim(0.00,1.00)
updated_f1_plot.set_title("Classification Model Comparison by F1 Score")
updated_f1_plot.set_xlabel("F1 Score")
updated_f1_plot.set_ylabel("Data Set")
updated_f1_plot.grid(False)
updated_f1_plot.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
for container in updated_f1_plot.containers:
    updated_f1_plot.bar_label(container, fmt='%.5f', padding=-50, color='white', fontweight='bold')
No description has been provided for this image

1.6.9 Model Inference ¶

  1. For the final selected stacked classifier developed from the train data (SMOTE-upsampled), the contributions of the base learners, ranked by importance, are given as follows:
    • Base learner: random forest model
    • Base learner: decision tree model
    • Base learner: support vector machine model
  2. For each base learner of the final selected stacked classifier developed from the train data (SMOTE-upsampled), the contributions of the predictors, ranked by importance, are given as follows:
    • Base learner: random forest model
      • ALLERGY
      • ALCOHOL_CONSUMING
      • PEER_PRESSURE
      • ANXIETY
      • FATIGUE
      • WHEEZING
      • SWALLOWING_DIFFICULTY
      • COUGHING
      • CHEST_PAIN
      • YELLOW_FINGERS
    • Base learner: decision tree model
      • ALLERGY
      • PEER_PRESSURE
      • ALCOHOL_CONSUMING
      • YELLOW_FINGERS
    • Base learner: support vector machine model
      • ALLERGY
      • PEER_PRESSURE
      • ANXIETY
      • FATIGUE
      • SWALLOWING_DIFFICULTY
      • WHEEZING
      • ALCOHOL_CONSUMING
      • COUGHING
      • CHEST_PAIN
      • YELLOW_FINGERS
  3. Model inference involved indicating the characteristics and predicting the probability of the new case against the model training observations.
    • Characteristics based on all predictors used for generating the final selected stacked classifier
    • Predicted lung cancer probability based on the final selected stacked classifier logistic curve
In [207]:
##################################
# Assigning as the final model
# the candidate model which 
# demonstrated the best performance
# on the test set
##################################
final_model = stacked_balanced_class_best_model_upsampled.named_steps['stacked_model']
final_model_base_learner = ['Stacked Model Base Learner: Decision Trees',
                            'Stacked Model Base Learner: Random Forest',
                            'Stacked Model Base Learner: Support Vector Machine']
In [208]:
##################################
# Defining a function to compute and plot 
# the feature importance for a defined model
##################################
def plot_feature_importance(importance, feature_names, model_name):
    indices = np.argsort(importance)
    plt.figure(figsize=(17, 8))
    plt.title(f"Feature Importance - {model_name}")
    plt.barh(range(len(importance)), importance[indices], align="center")
    plt.yticks(range(len(importance)), [feature_names[i] for i in indices])
    plt.tight_layout()
    plt.show()
In [209]:
##################################
# Defining the predictor names
##################################
feature_names = X_test.columns
In [210]:
##################################
# Ranking the predictors based on model importance
# for each base learner using feature importance
# for tree-based models like DecisionTree and Random Forest
# and coefficients for linear models like SVC with linear kernel
##################################
for index, (name, model) in enumerate(final_model.named_estimators_.items()):
    if hasattr(model, 'feature_importances_'):  # For tree-based models like DecisionTree and RandomForest
        plot_feature_importance(model.feature_importances_, feature_names, model_name=final_model_base_learner[index])
    elif hasattr(model, 'coef_'):  # For linear models like SVC with linear kernel
        importance = np.abs(model.coef_).flatten()
        plot_feature_importance(importance, feature_names, model_name=final_model_base_learner[index])
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
In [211]:
##################################
# Generating predictions from the 
# base learners to be used as input
# to the logistic regression meta-learner
##################################
base_learners_predictions = []
for name, model in final_model.named_estimators_.items():
    base_learners_predictions.append(model.predict_proba(X_test)[:, 1])
In [212]:
##################################
# Stacking the base learners' predictions
##################################
meta_input = np.column_stack(base_learners_predictions)

##################################
# Defining the base learner model names
##################################
meta_feature_names = [f'Model Prediction - {x}' for x in final_model_base_learner]

##################################
# Ranking the predictors based on model importance
# for each meta-learner using coefficients
# for linear models like logistic regression
##################################
if hasattr(final_model.final_estimator_, 'coef_'):
    importance = np.abs(final_model.final_estimator_.coef_).flatten()
    plot_feature_importance(importance, meta_feature_names, model_name='Stacked Model Meta-Learner: Logistic Regression')
No description has been provided for this image
In [213]:
##################################
# Rebuilding the upsampled training data
# for plotting categorical distributions
##################################
lung_cancer_train_smote = pd.concat([X_train_smote, y_train_smote], axis=1)
lung_cancer_train_smote.iloc[:,0:10] = lung_cancer_train_smote.iloc[:,0:10].replace({0: 'Absent', 1: 'Present'})
lung_cancer_train_smote['LUNG_CANCER'] = lung_cancer_train_smote['LUNG_CANCER'].astype('category')
lung_cancer_train_smote['LUNG_CANCER'] = lung_cancer_train_smote['LUNG_CANCER'].cat.rename_categories({0: 'No', 1: 'Yes'})
lung_cancer_train_smote[lung_cancer_train_smote.columns[0:11]] = lung_cancer_train_smote[lung_cancer_train_smote.columns[0:11]].astype('category')
lung_cancer_train_smote.head()
Out[213]:
YELLOW_FINGERS ANXIETY PEER_PRESSURE FATIGUE ALLERGY WHEEZING ALCOHOL_CONSUMING COUGHING SWALLOWING_DIFFICULTY CHEST_PAIN LUNG_CANCER
0 Absent Absent Present Present Present Present Present Absent Present Absent Yes
1 Present Present Absent Absent Present Present Present Present Present Present Yes
2 Present Present Present Present Absent Present Absent Present Present Absent Yes
3 Absent Absent Absent Present Present Present Present Absent Present Present Yes
4 Present Present Present Present Absent Absent Absent Absent Present Absent Yes
In [214]:
##################################
# Plotting the categorical distributions
# for a low-risk test case
##################################
fig, axs = plt.subplots(2, 5, figsize=(17, 8))

colors = ['blue','red']
level_order = ['Absent','Present']

sns.countplot(x='YELLOW_FINGERS', hue='LUNG_CANCER', data=lung_cancer_train_smote, ax=axs[0, 0], order=level_order, palette=colors)
axs[0, 0].set_title('YELLOW_FINGERS')
axs[0, 0].set_ylabel('Classification Model Training Case Count')
axs[0, 0].set_xlabel(None)
axs[0, 0].set_ylim(0, 200)
axs[0, 0].legend(title='LUNG_CANCER', loc='upper center')
for patch, color in zip(axs[0, 0].patches, ['blue','blue','red','red'] ):
    patch.set_facecolor(color)
    patch.set_alpha(0.2)

sns.countplot(x='ANXIETY', hue='LUNG_CANCER', data=lung_cancer_train_smote, ax=axs[0, 1], order=level_order, palette=colors)
axs[0, 1].set_title('ANXIETY')
axs[0, 1].set_ylabel('Classification Model Training Case Count')
axs[0, 1].set_xlabel(None)
axs[0, 1].set_ylim(0, 200)
axs[0, 1].legend(title='LUNG_CANCER', loc='upper center')
for patch, color in zip(axs[0, 1].patches, ['blue','blue','red','red'] ):
    patch.set_facecolor(color)
    patch.set_alpha(0.2)

sns.countplot(x='PEER_PRESSURE', hue='LUNG_CANCER', data=lung_cancer_train_smote, ax=axs[0, 2], order=level_order, palette=colors)
axs[0, 2].set_title('PEER_PRESSURE')
axs[0, 2].set_ylabel('Classification Model Training Case Count')
axs[0, 2].set_xlabel(None)
axs[0, 2].set_ylim(0, 200)
axs[0, 2].legend(title='LUNG_CANCER', loc='upper center')
for patch, color in zip(axs[0, 2].patches, ['blue','blue','red','red'] ):
    patch.set_facecolor(color)
    patch.set_alpha(0.2)

sns.countplot(x='FATIGUE', hue='LUNG_CANCER', data=lung_cancer_train_smote, ax=axs[0, 3], order=level_order, palette=colors)
axs[0, 3].set_title('FATIGUE')
axs[0, 3].set_ylabel('Classification Model Training Case Count')
axs[0, 3].set_xlabel(None)
axs[0, 3].set_ylim(0, 200)
axs[0, 3].legend(title='LUNG_CANCER', loc='upper center')
for patch, color in zip(axs[0, 3].patches, ['blue','blue','red','red'] ):
    patch.set_facecolor(color)
    patch.set_alpha(0.2)

sns.countplot(x='ALLERGY', hue='LUNG_CANCER', data=lung_cancer_train_smote, ax=axs[0, 4], order=level_order, palette=colors)
axs[0, 4].set_title('ALLERGY')
axs[0, 4].set_ylabel('Classification Model Training Case Count')
axs[0, 4].set_xlabel(None)
axs[0, 4].set_ylim(0, 200)
axs[0, 4].legend(title='LUNG_CANCER', loc='upper center')
for patch, color in zip(axs[0, 4].patches, ['blue','blue','red','red'] ):
    patch.set_facecolor(color)
    patch.set_alpha(0.2)

sns.countplot(x='WHEEZING', hue='LUNG_CANCER', data=lung_cancer_train_smote, ax=axs[1, 0], order=level_order, palette=colors)
axs[1, 0].set_title('WHEEZING')
axs[1, 0].set_ylabel('Classification Model Training Case Count')
axs[1, 0].set_xlabel(None)
axs[1, 0].set_ylim(0, 200)
axs[1, 0].legend(title='LUNG_CANCER', loc='upper center')
for patch, color in zip(axs[1, 0].patches, ['blue','blue','red','red'] ):
    patch.set_facecolor(color)
    patch.set_alpha(0.2)

sns.countplot(x='ALCOHOL_CONSUMING', hue='LUNG_CANCER', data=lung_cancer_train_smote, ax=axs[1, 1], order=level_order, palette=colors)
axs[1, 1].set_title('ALCOHOL_CONSUMING')
axs[1, 1].set_ylabel('Classification Model Training Case Count')
axs[1, 1].set_xlabel(None)
axs[1, 1].set_ylim(0, 200)
axs[1, 1].legend(title='LUNG_CANCER', loc='upper center')
for patch, color in zip(axs[1, 1].patches, ['blue','blue','red','red'] ):
    patch.set_facecolor(color)
    patch.set_alpha(0.2)

sns.countplot(x='COUGHING', hue='LUNG_CANCER', data=lung_cancer_train_smote, ax=axs[1, 2], order=level_order, palette=colors)
axs[1, 2].set_title('COUGHING')
axs[1, 2].set_ylabel('Classification Model Training Case Count')
axs[1, 2].set_xlabel(None)
axs[1, 2].set_ylim(0, 200)
axs[1, 2].legend(title='LUNG_CANCER', loc='upper center')
for patch, color in zip(axs[1, 2].patches, ['blue','blue','red','red'] ):
    patch.set_facecolor(color)
    patch.set_alpha(0.2)

sns.countplot(x='SWALLOWING_DIFFICULTY', hue='LUNG_CANCER', data=lung_cancer_train_smote, ax=axs[1, 3], order=level_order, palette=colors)
axs[1, 3].set_title('SWALLOWING_DIFFICULTY')
axs[1, 3].set_ylabel('Classification Model Training Case Count')
axs[1, 3].set_xlabel(None)
axs[1, 3].set_ylim(0, 200)
axs[1, 3].legend(title='LUNG_CANCER', loc='upper center')
for patch, color in zip(axs[1, 3].patches, ['blue','blue','red','red'] ):
    patch.set_facecolor(color)
    patch.set_alpha(0.2)

sns.countplot(x='CHEST_PAIN', hue='LUNG_CANCER', data=lung_cancer_train_smote, ax=axs[1, 4], order=level_order, palette=colors)
axs[1, 4].set_title('CHEST_PAIN')
axs[1, 4].set_ylabel('Classification Model Training Case Count')
axs[1, 4].set_xlabel(None)
axs[1, 4].set_ylim(0, 200)
axs[1, 4].legend(title='LUNG_CANCER', loc='upper center')
for patch, color in zip(axs[1, 4].patches, ['blue','blue','red','red'] ):
    patch.set_facecolor(color)
    patch.set_alpha(0.2)

plt.tight_layout()
plt.show()
No description has been provided for this image
In [215]:
##################################
# Plotting the estimated logistic curve
# of the final classification model
# involving a stacked model with
# a logistic regression meta-learner
# and random forest, SVC and decision tree
# base learners
##################################
plt.figure(figsize=(17, 8))
plt.plot(stacked_balanced_class_best_model_upsampled_logit_values_sorted, 
         stacked_balanced_class_best_model_upsampled_probabilities_sorted, label='Classification Model Logistic Curve', color='black')
plt.ylim(-0.05, 1.05)
plt.xlim(-6.00, 6.00)
target_0_indices = y_train_smote == 0
target_1_indices = y_train_smote == 1
plt.scatter(stacked_balanced_class_best_model_upsampled_logit_values[target_0_indices], 
            stacked_balanced_class_best_model_upsampled_probabilities[target_0_indices], 
            color='blue', alpha=0.40, s=100, marker= 'o', edgecolor='k', label='LUNG_CANCER=NO')
plt.scatter(stacked_balanced_class_best_model_upsampled_logit_values[target_1_indices], 
            stacked_balanced_class_best_model_upsampled_probabilities[target_1_indices], 
            color='red', alpha=0.40, s=100, marker='o', edgecolor='k', label='LUNG_CANCER=YES')
plt.axhline(0.5, color='green', linestyle='--', label='Classification Model Threshold')
plt.title('Final Classification Model: Stacked Model (Meta-Learner = Logistic Regression, Base Learners: Random Forest, Support Vector Classifier, Decision Tree)')
plt.xlabel('Logit (Log-Odds)')
plt.ylabel('Estimated Lung Cancer Probability')
plt.grid(True)
plt.legend(loc='upper left')
plt.show()
No description has been provided for this image
In [216]:
##################################
# Describing the details of a 
# low-risk test case
##################################
X_sample = {"YELLOW_FINGERS":1,
            "ANXIETY":0,
            "PEER_PRESSURE":0,
            "FATIGUE":0,
            "ALLERGY":0,
            "WHEEZING":1,
            "ALCOHOL_CONSUMING":0,
            "COUGHING":0,
            "SWALLOWING_DIFFICULTY":1,
            "CHEST_PAIN":1}
X_test_sample = pd.DataFrame([X_sample])
X_test_sample.head()
Out[216]:
YELLOW_FINGERS ANXIETY PEER_PRESSURE FATIGUE ALLERGY WHEEZING ALCOHOL_CONSUMING COUGHING SWALLOWING_DIFFICULTY CHEST_PAIN
0 1 0 0 0 0 1 0 0 1 1
In [217]:
##################################
# Rebuilding the low-risk test case data
# for plotting categorical distributions
##################################
X_test_sample_category = X_test_sample.copy()
int_test_columns = X_test_sample_category.columns
X_test_sample_category[int_test_columns] = X_test_sample_category[int_test_columns].astype(object)
X_test_sample_category[int_test_columns] = X_test_sample_category[int_test_columns].replace({0: 'Absent', 1: 'Present'})
X_test_sample_category.head()
Out[217]:
YELLOW_FINGERS ANXIETY PEER_PRESSURE FATIGUE ALLERGY WHEEZING ALCOHOL_CONSUMING COUGHING SWALLOWING_DIFFICULTY CHEST_PAIN
0 Present Absent Absent Absent Absent Present Absent Absent Present Present
In [218]:
##################################
# Plotting the categorical distributions
# for a low-risk test case
##################################
fig, axs = plt.subplots(2, 5, figsize=(17, 8))

colors = ['blue','red']
level_order = ['Absent','Present']

sns.countplot(x='YELLOW_FINGERS', hue='LUNG_CANCER', data=lung_cancer_train_smote, ax=axs[0, 0], order=level_order, palette=colors)
axs[0, 0].axvline(level_order.index(X_test_sample_category['YELLOW_FINGERS'].iloc[0]), color='black', linestyle='--', linewidth=3)
axs[0, 0].set_title('YELLOW_FINGERS')
axs[0, 0].set_ylabel('Classification Model Training Case Count')
axs[0, 0].set_xlabel(None)
axs[0, 0].set_ylim(0, 200)
axs[0, 0].legend(title='LUNG_CANCER', loc='upper center')
for patch, color in zip(axs[0, 0].patches, ['blue','blue','red','red'] ):
    patch.set_facecolor(color)
    patch.set_alpha(0.2)

sns.countplot(x='ANXIETY', hue='LUNG_CANCER', data=lung_cancer_train_smote, ax=axs[0, 1], order=level_order, palette=colors)
axs[0, 1].axvline(level_order.index(X_test_sample_category['ANXIETY'].iloc[0]), color='black', linestyle='--', linewidth=3)
axs[0, 1].set_title('ANXIETY')
axs[0, 1].set_ylabel('Classification Model Training Case Count')
axs[0, 1].set_xlabel(None)
axs[0, 1].set_ylim(0, 200)
axs[0, 1].legend(title='LUNG_CANCER', loc='upper center')
for patch, color in zip(axs[0, 1].patches, ['blue','blue','red','red'] ):
    patch.set_facecolor(color)
    patch.set_alpha(0.2)

sns.countplot(x='PEER_PRESSURE', hue='LUNG_CANCER', data=lung_cancer_train_smote, ax=axs[0, 2], order=level_order, palette=colors)
axs[0, 2].axvline(level_order.index(X_test_sample_category['PEER_PRESSURE'].iloc[0]), color='black', linestyle='--', linewidth=3)
axs[0, 2].set_title('PEER_PRESSURE')
axs[0, 2].set_ylabel('Classification Model Training Case Count')
axs[0, 2].set_xlabel(None)
axs[0, 2].set_ylim(0, 200)
axs[0, 2].legend(title='LUNG_CANCER', loc='upper center')
for patch, color in zip(axs[0, 2].patches, ['blue','blue','red','red'] ):
    patch.set_facecolor(color)
    patch.set_alpha(0.2)

sns.countplot(x='FATIGUE', hue='LUNG_CANCER', data=lung_cancer_train_smote, ax=axs[0, 3], order=level_order, palette=colors)
axs[0, 3].axvline(level_order.index(X_test_sample_category['FATIGUE'].iloc[0]), color='black', linestyle='--', linewidth=3)
axs[0, 3].set_title('FATIGUE')
axs[0, 3].set_ylabel('Classification Model Training Case Count')
axs[0, 3].set_xlabel(None)
axs[0, 3].set_ylim(0, 200)
axs[0, 3].legend(title='LUNG_CANCER', loc='upper center')
for patch, color in zip(axs[0, 3].patches, ['blue','blue','red','red'] ):
    patch.set_facecolor(color)
    patch.set_alpha(0.2)

sns.countplot(x='ALLERGY', hue='LUNG_CANCER', data=lung_cancer_train_smote, ax=axs[0, 4], order=level_order, palette=colors)
axs[0, 4].axvline(level_order.index(X_test_sample_category['ALLERGY'].iloc[0]), color='black', linestyle='--', linewidth=3)
axs[0, 4].set_title('ALLERGY')
axs[0, 4].set_ylabel('Classification Model Training Case Count')
axs[0, 4].set_xlabel(None)
axs[0, 4].set_ylim(0, 200)
axs[0, 4].legend(title='LUNG_CANCER', loc='upper center')
for patch, color in zip(axs[0, 4].patches, ['blue','blue','red','red'] ):
    patch.set_facecolor(color)
    patch.set_alpha(0.2)

sns.countplot(x='WHEEZING', hue='LUNG_CANCER', data=lung_cancer_train_smote, ax=axs[1, 0], order=level_order, palette=colors)
axs[1, 0].axvline(level_order.index(X_test_sample_category['WHEEZING'].iloc[0]), color='black', linestyle='--', linewidth=3)
axs[1, 0].set_title('WHEEZING')
axs[1, 0].set_ylabel('Classification Model Training Case Count')
axs[1, 0].set_xlabel(None)
axs[1, 0].set_ylim(0, 200)
axs[1, 0].legend(title='LUNG_CANCER', loc='upper center')
for patch, color in zip(axs[1, 0].patches, ['blue','blue','red','red'] ):
    patch.set_facecolor(color)
    patch.set_alpha(0.2)

sns.countplot(x='ALCOHOL_CONSUMING', hue='LUNG_CANCER', data=lung_cancer_train_smote, ax=axs[1, 1], order=level_order, palette=colors)
axs[1, 1].axvline(level_order.index(X_test_sample_category['ALCOHOL_CONSUMING'].iloc[0]), color='black', linestyle='--', linewidth=3)
axs[1, 1].set_title('ALCOHOL_CONSUMING')
axs[1, 1].set_ylabel('Classification Model Training Case Count')
axs[1, 1].set_xlabel(None)
axs[1, 1].set_ylim(0, 200)
axs[1, 1].legend(title='LUNG_CANCER', loc='upper center')
for patch, color in zip(axs[1, 1].patches, ['blue','blue','red','red'] ):
    patch.set_facecolor(color)
    patch.set_alpha(0.2)

sns.countplot(x='COUGHING', hue='LUNG_CANCER', data=lung_cancer_train_smote, ax=axs[1, 2], order=level_order, palette=colors)
axs[1, 2].axvline(level_order.index(X_test_sample_category['COUGHING'].iloc[0]), color='black', linestyle='--', linewidth=3)
axs[1, 2].set_title('COUGHING')
axs[1, 2].set_ylabel('Classification Model Training Case Count')
axs[1, 2].set_xlabel(None)
axs[1, 2].set_ylim(0, 200)
axs[1, 2].legend(title='LUNG_CANCER', loc='upper center')
for patch, color in zip(axs[1, 2].patches, ['blue','blue','red','red'] ):
    patch.set_facecolor(color)
    patch.set_alpha(0.2)

sns.countplot(x='SWALLOWING_DIFFICULTY', hue='LUNG_CANCER', data=lung_cancer_train_smote, ax=axs[1, 3], order=level_order, palette=colors)
axs[1, 3].axvline(level_order.index(X_test_sample_category['SWALLOWING_DIFFICULTY'].iloc[0]), color='black', linestyle='--', linewidth=3)
axs[1, 3].set_title('SWALLOWING_DIFFICULTY')
axs[1, 3].set_ylabel('Classification Model Training Case Count')
axs[1, 3].set_xlabel(None)
axs[1, 3].set_ylim(0, 200)
axs[1, 3].legend(title='LUNG_CANCER', loc='upper center')
for patch, color in zip(axs[1, 3].patches, ['blue','blue','red','red'] ):
    patch.set_facecolor(color)
    patch.set_alpha(0.2)

sns.countplot(x='CHEST_PAIN', hue='LUNG_CANCER', data=lung_cancer_train_smote, ax=axs[1, 4], order=level_order, palette=colors)
axs[1, 4].axvline(level_order.index(X_test_sample_category['CHEST_PAIN'].iloc[0]), color='black', linestyle='--', linewidth=3)
axs[1, 4].set_title('CHEST_PAIN')
axs[1, 4].set_ylabel('Classification Model Training Case Count')
axs[1, 4].set_xlabel(None)
axs[1, 4].set_ylim(0, 200)
axs[1, 4].legend(title='LUNG_CANCER', loc='upper center')
for patch, color in zip(axs[1, 4].patches, ['blue','blue','red','red'] ):
    patch.set_facecolor(color)
    patch.set_alpha(0.2)

plt.tight_layout()
plt.show()
No description has been provided for this image
In [219]:
##################################
# Computing the logit and estimated probability
# for the test case
##################################
X_sample_logit = stacked_balanced_class_best_model_upsampled.decision_function(X_test_sample)[0]
X_sample_probability = stacked_balanced_class_best_model_upsampled.predict_proba(X_test_sample)[0, 1]
X_sample_class = "Low-Risk" if X_sample_probability < 0.50 else "High-Risk"
print(f"Test Case Risk Index: {X_sample_logit}")
print(f"Test Case Probability: {X_sample_probability}")
print(f"Test Case Risk Category: {X_sample_class}")
Test Case Risk Index: -1.2117837409390746
Test Case Probability: 0.22938559072691203
Test Case Risk Category: Low-Risk
In [220]:
##################################
# Plotting the logit and estimated probability
# for the low-risk test case 
# in the estimated logistic curve
# of the final classification model
##################################
plt.figure(figsize=(17, 8))
plt.plot(stacked_balanced_class_best_model_upsampled_logit_values_sorted, 
         stacked_balanced_class_best_model_upsampled_probabilities_sorted, label='Classification Model Logistic Curve', color='black')
plt.ylim(-0.05, 1.05)
plt.xlim(-6.00, 6.00)
target_0_indices = y_train_smote == 0
target_1_indices = y_train_smote == 1
plt.axhline(0.5, color='green', linestyle='--', label='Classification Model Threshold')
plt.scatter(stacked_balanced_class_best_model_upsampled_logit_values[target_0_indices], 
            stacked_balanced_class_best_model_upsampled_probabilities[target_0_indices], 
            color='blue', alpha=0.20, s=100, marker= 'o', edgecolor='k', label='Classification Model Training Cases: LUNG_CANCER = No')
plt.scatter(stacked_balanced_class_best_model_upsampled_logit_values[target_1_indices], 
            stacked_balanced_class_best_model_upsampled_probabilities[target_1_indices], 
            color='red', alpha=0.20, s=100, marker='o', edgecolor='k', label='Classification Model Training Cases: LUNG_CANCER = Yes')
if X_sample_class == "Low-Risk":
    plt.scatter(X_sample_logit, X_sample_probability, color='blue', s=125, edgecolor='k', label='Test Case (Low-Risk)', marker= 's', zorder=5)
    plt.axvline(X_sample_logit, color='black', linestyle='--', linewidth=3)
    plt.axhline(X_sample_probability, color='black', linestyle='--', linewidth=3)
if X_sample_class == "High-Risk":
    plt.scatter(X_sample_logit, X_sample_probability, color='red', s=125, edgecolor='k', label='Test Case (High-Risk)', marker= 's', zorder=5)
    plt.axvline(X_sample_logit, color='black', linestyle='--', linewidth=3)
    plt.axhline(X_sample_probability, color='black', linestyle='--', linewidth=3)
plt.title('Final Classification Model: Stacked Model (Meta-Learner = Logistic Regression, Base Learners = Random Forest, Support Vector Classifier, Decision Tree)')
plt.xlabel('Logit (Log-Odds)')
plt.ylabel('Estimated Lung Cancer Probability')
plt.grid(False)
plt.legend(facecolor='white', framealpha=1, loc='upper center', bbox_to_anchor=(0.5, -0.10), ncol=3)
plt.tight_layout(rect=[0, 0, 1.00, 0.95])
plt.show()
No description has been provided for this image
In [221]:
##################################
# Describing the details of a 
# high-risk test case
##################################
X_sample = {"YELLOW_FINGERS":1,
            "ANXIETY":0,
            "PEER_PRESSURE":1,
            "FATIGUE":0,
            "ALLERGY":1,
            "WHEEZING":1,
            "ALCOHOL_CONSUMING":0,
            "COUGHING":1,
            "SWALLOWING_DIFFICULTY":1,
            "CHEST_PAIN":1}
X_test_sample = pd.DataFrame([X_sample])
X_test_sample.head()
Out[221]:
YELLOW_FINGERS ANXIETY PEER_PRESSURE FATIGUE ALLERGY WHEEZING ALCOHOL_CONSUMING COUGHING SWALLOWING_DIFFICULTY CHEST_PAIN
0 1 0 1 0 1 1 0 1 1 1
In [222]:
##################################
# Rebuilding the high-risk test case data
# for plotting categorical distributions
##################################
X_test_sample_category = X_test_sample.copy()
int_test_columns = X_test_sample_category.columns
X_test_sample_category[int_test_columns] = X_test_sample_category[int_test_columns].astype(object)
X_test_sample_category[int_test_columns] = X_test_sample_category[int_test_columns].replace({0: 'Absent', 1: 'Present'})
X_test_sample_category.head()
Out[222]:
YELLOW_FINGERS ANXIETY PEER_PRESSURE FATIGUE ALLERGY WHEEZING ALCOHOL_CONSUMING COUGHING SWALLOWING_DIFFICULTY CHEST_PAIN
0 Present Absent Present Absent Present Present Absent Present Present Present
In [223]:
##################################
# Plotting the categorical distributions
# for a low-risk test case
##################################
fig, axs = plt.subplots(2, 5, figsize=(17, 8))

colors = ['blue','red']
level_order = ['Absent','Present']

sns.countplot(x='YELLOW_FINGERS', hue='LUNG_CANCER', data=lung_cancer_train_smote, ax=axs[0, 0], order=level_order, palette=colors)
axs[0, 0].axvline(level_order.index(X_test_sample_category['YELLOW_FINGERS'].iloc[0]), color='black', linestyle='--', linewidth=3)
axs[0, 0].set_title('YELLOW_FINGERS')
axs[0, 0].set_ylabel('Classification Model Training Case Count')
axs[0, 0].set_xlabel(None)
axs[0, 0].set_ylim(0, 200)
axs[0, 0].legend(title='LUNG_CANCER', loc='upper center')
for patch, color in zip(axs[0, 0].patches, ['blue','blue','red','red'] ):
    patch.set_facecolor(color)
    patch.set_alpha(0.2)

sns.countplot(x='ANXIETY', hue='LUNG_CANCER', data=lung_cancer_train_smote, ax=axs[0, 1], order=level_order, palette=colors)
axs[0, 1].axvline(level_order.index(X_test_sample_category['ANXIETY'].iloc[0]), color='black', linestyle='--', linewidth=3)
axs[0, 1].set_title('ANXIETY')
axs[0, 1].set_ylabel('Classification Model Training Case Count')
axs[0, 1].set_xlabel(None)
axs[0, 1].set_ylim(0, 200)
axs[0, 1].legend(title='LUNG_CANCER', loc='upper center')
for patch, color in zip(axs[0, 1].patches, ['blue','blue','red','red'] ):
    patch.set_facecolor(color)
    patch.set_alpha(0.2)

sns.countplot(x='PEER_PRESSURE', hue='LUNG_CANCER', data=lung_cancer_train_smote, ax=axs[0, 2], order=level_order, palette=colors)
axs[0, 2].axvline(level_order.index(X_test_sample_category['PEER_PRESSURE'].iloc[0]), color='black', linestyle='--', linewidth=3)
axs[0, 2].set_title('PEER_PRESSURE')
axs[0, 2].set_ylabel('Classification Model Training Case Count')
axs[0, 2].set_xlabel(None)
axs[0, 2].set_ylim(0, 200)
axs[0, 2].legend(title='LUNG_CANCER', loc='upper center')
for patch, color in zip(axs[0, 2].patches, ['blue','blue','red','red'] ):
    patch.set_facecolor(color)
    patch.set_alpha(0.2)

sns.countplot(x='FATIGUE', hue='LUNG_CANCER', data=lung_cancer_train_smote, ax=axs[0, 3], order=level_order, palette=colors)
axs[0, 3].axvline(level_order.index(X_test_sample_category['FATIGUE'].iloc[0]), color='black', linestyle='--', linewidth=3)
axs[0, 3].set_title('FATIGUE')
axs[0, 3].set_ylabel('Classification Model Training Case Count')
axs[0, 3].set_xlabel(None)
axs[0, 3].set_ylim(0, 200)
axs[0, 3].legend(title='LUNG_CANCER', loc='upper center')
for patch, color in zip(axs[0, 3].patches, ['blue','blue','red','red'] ):
    patch.set_facecolor(color)
    patch.set_alpha(0.2)

sns.countplot(x='ALLERGY', hue='LUNG_CANCER', data=lung_cancer_train_smote, ax=axs[0, 4], order=level_order, palette=colors)
axs[0, 4].axvline(level_order.index(X_test_sample_category['ALLERGY'].iloc[0]), color='black', linestyle='--', linewidth=3)
axs[0, 4].set_title('ALLERGY')
axs[0, 4].set_ylabel('Classification Model Training Case Count')
axs[0, 4].set_xlabel(None)
axs[0, 4].set_ylim(0, 200)
axs[0, 4].legend(title='LUNG_CANCER', loc='upper center')
for patch, color in zip(axs[0, 4].patches, ['blue','blue','red','red'] ):
    patch.set_facecolor(color)
    patch.set_alpha(0.2)

sns.countplot(x='WHEEZING', hue='LUNG_CANCER', data=lung_cancer_train_smote, ax=axs[1, 0], order=level_order, palette=colors)
axs[1, 0].axvline(level_order.index(X_test_sample_category['WHEEZING'].iloc[0]), color='black', linestyle='--', linewidth=3)
axs[1, 0].set_title('WHEEZING')
axs[1, 0].set_ylabel('Classification Model Training Case Count')
axs[1, 0].set_xlabel(None)
axs[1, 0].set_ylim(0, 200)
axs[1, 0].legend(title='LUNG_CANCER', loc='upper center')
for patch, color in zip(axs[1, 0].patches, ['blue','blue','red','red'] ):
    patch.set_facecolor(color)
    patch.set_alpha(0.2)

sns.countplot(x='ALCOHOL_CONSUMING', hue='LUNG_CANCER', data=lung_cancer_train_smote, ax=axs[1, 1], order=level_order, palette=colors)
axs[1, 1].axvline(level_order.index(X_test_sample_category['ALCOHOL_CONSUMING'].iloc[0]), color='black', linestyle='--', linewidth=3)
axs[1, 1].set_title('ALCOHOL_CONSUMING')
axs[1, 1].set_ylabel('Classification Model Training Case Count')
axs[1, 1].set_xlabel(None)
axs[1, 1].set_ylim(0, 200)
axs[1, 1].legend(title='LUNG_CANCER', loc='upper center')
for patch, color in zip(axs[1, 1].patches, ['blue','blue','red','red'] ):
    patch.set_facecolor(color)
    patch.set_alpha(0.2)

sns.countplot(x='COUGHING', hue='LUNG_CANCER', data=lung_cancer_train_smote, ax=axs[1, 2], order=level_order, palette=colors)
axs[1, 2].axvline(level_order.index(X_test_sample_category['COUGHING'].iloc[0]), color='black', linestyle='--', linewidth=3)
axs[1, 2].set_title('COUGHING')
axs[1, 2].set_ylabel('Classification Model Training Case Count')
axs[1, 2].set_xlabel(None)
axs[1, 2].set_ylim(0, 200)
axs[1, 2].legend(title='LUNG_CANCER', loc='upper center')
for patch, color in zip(axs[1, 2].patches, ['blue','blue','red','red'] ):
    patch.set_facecolor(color)
    patch.set_alpha(0.2)

sns.countplot(x='SWALLOWING_DIFFICULTY', hue='LUNG_CANCER', data=lung_cancer_train_smote, ax=axs[1, 3], order=level_order, palette=colors)
axs[1, 3].axvline(level_order.index(X_test_sample_category['SWALLOWING_DIFFICULTY'].iloc[0]), color='black', linestyle='--', linewidth=3)
axs[1, 3].set_title('SWALLOWING_DIFFICULTY')
axs[1, 3].set_ylabel('Classification Model Training Case Count')
axs[1, 3].set_xlabel(None)
axs[1, 3].set_ylim(0, 200)
axs[1, 3].legend(title='LUNG_CANCER', loc='upper center')
for patch, color in zip(axs[1, 3].patches, ['blue','blue','red','red'] ):
    patch.set_facecolor(color)
    patch.set_alpha(0.2)

sns.countplot(x='CHEST_PAIN', hue='LUNG_CANCER', data=lung_cancer_train_smote, ax=axs[1, 4], order=level_order, palette=colors)
axs[1, 4].axvline(level_order.index(X_test_sample_category['CHEST_PAIN'].iloc[0]), color='black', linestyle='--', linewidth=3)
axs[1, 4].set_title('CHEST_PAIN')
axs[1, 4].set_ylabel('Classification Model Training Case Count')
axs[1, 4].set_xlabel(None)
axs[1, 4].set_ylim(0, 200)
axs[1, 4].legend(title='LUNG_CANCER', loc='upper center')
for patch, color in zip(axs[1, 4].patches, ['blue','blue','red','red'] ):
    patch.set_facecolor(color)
    patch.set_alpha(0.2)

plt.tight_layout()
plt.show()
No description has been provided for this image
In [224]:
##################################
# Computing the logit and estimated probability
# for a high-risk test case
##################################
X_sample_logit = stacked_balanced_class_best_model_upsampled.decision_function(X_test_sample)[0]
X_sample_probability = stacked_balanced_class_best_model_upsampled.predict_proba(X_test_sample)[0, 1]
X_sample_class = "Low-Risk" if X_sample_probability < 0.50 else "High-Risk"
print(f"Test Case Risk Index: {X_sample_logit}")
print(f"Test Case Probability: {X_sample_probability}")
print(f"Test Case Risk Category: {X_sample_class}")
Test Case Risk Index: 3.4784950973590973
Test Case Probability: 0.9700696569701589
Test Case Risk Category: High-Risk
In [225]:
##################################
# Plotting the logit and estimated probability
# for the high-risk test case 
# in the estimated logistic curve
# of the final classification model
##################################
plt.figure(figsize=(17, 8))
plt.plot(stacked_balanced_class_best_model_upsampled_logit_values_sorted, 
         stacked_balanced_class_best_model_upsampled_probabilities_sorted, label='Classification Model Logistic Curve', color='black')
plt.ylim(-0.05, 1.05)
plt.xlim(-6.00, 6.00)
target_0_indices = y_train_smote == 0
target_1_indices = y_train_smote == 1
plt.axhline(0.5, color='green', linestyle='--', label='Classification Model Threshold')
plt.scatter(stacked_balanced_class_best_model_upsampled_logit_values[target_0_indices], 
            stacked_balanced_class_best_model_upsampled_probabilities[target_0_indices], 
            color='blue', alpha=0.20, s=100, marker= 'o', edgecolor='k', label='Classification Model Training Cases: LUNG_CANCER = No')
plt.scatter(stacked_balanced_class_best_model_upsampled_logit_values[target_1_indices], 
            stacked_balanced_class_best_model_upsampled_probabilities[target_1_indices], 
            color='red', alpha=0.20, s=100, marker='o', edgecolor='k', label='Classification Model Training Cases: LUNG_CANCER = Yes')
if X_sample_class == "Low-Risk":
    plt.scatter(X_sample_logit, X_sample_probability, color='blue', s=125, edgecolor='k', label='Test Case (Low-Risk)', marker= 's', zorder=5)
    plt.axvline(X_sample_logit, color='black', linestyle='--', linewidth=3)
    plt.axhline(X_sample_probability, color='black', linestyle='--', linewidth=3)
if X_sample_class == "High-Risk":
    plt.scatter(X_sample_logit, X_sample_probability, color='red', s=125, edgecolor='k', label='Test Case (High-Risk)', marker= 's', zorder=5)
    plt.axvline(X_sample_logit, color='black', linestyle='--', linewidth=3)
    plt.axhline(X_sample_probability, color='black', linestyle='--', linewidth=3)
plt.title('Final Classification Model: Stacked Model (Meta-Learner = Logistic Regression, Base Learners = Random Forest, Support Vector Classifier, Decision Tree)')
plt.xlabel('Logit (Log-Odds)')
plt.ylabel('Estimated Lung Cancer Probability')
plt.grid(False)
plt.legend(facecolor='white', framealpha=1, loc='upper center', bbox_to_anchor=(0.5, -0.10), ncol=3)
plt.tight_layout(rect=[0, 0, 1.00, 0.95])
plt.show()
No description has been provided for this image

1.7. Predictive Model Deployment Using Streamlit and Streamlit Community Cloud ¶

Streamlit is an open-source Python library that simplifies the creation and deployment of web applications for machine learning and data science projects. It allows developers and data scientists to turn Python scripts into interactive web apps quickly without requiring extensive web development knowledge. Streamlit seamlessly integrates with popular Python libraries such as Pandas, Matplotlib, Plotly, and TensorFlow, allowing one to leverage existing data processing and visualization tools within the application. Streamlit apps can be easily deployed on various platforms, including Streamlit Community Cloud, Heroku, or any cloud service that supports Python web applications.

Streamlit Community Cloud, formerly known as Streamlit Sharing, is a free cloud-based platform provided by Streamlit that allows users to easily deploy and share Streamlit apps online. It is particularly popular among data scientists, machine learning engineers, and developers for quickly showcasing projects, creating interactive demos, and sharing data-driven applications with a wider audience without needing to manage server infrastructure. Significant features include free hosting (Streamlit Community Cloud provides free hosting for Streamlit apps, making it accessible for users who want to share their work without incurring hosting costs), easy deployment (users can connect their GitHub repository to Streamlit Community Cloud, and the app is automatically deployed from the repository), continuous deployment (if the code in the connected GitHub repository is updated, the app is automatically redeployed with the latest changes), sharing capabilities (once deployed, apps can be shared with others via a simple URL, making it easy for collaborators, stakeholders, or the general public to access and interact with the app), built-in authentication (users can restrict access to their apps using GitHub-based authentication, allowing control over who can view and interact with the app), and community support (the platform is supported by a community of users and developers who share knowledge, templates, and best practices for building and deploying Streamlit apps).

1.7.1 Model Prediction Application Code Development ¶

  1. A model prediction application code in Python was developed to:
    • compute risk indices for the test case and the study population data as baseline
    • estimate lung cancer probabilities for the test case and the study population data as baseline
    • predict risk categories for the test case
  2. The model prediction application code was saved in a repository that was eventually cloned for uploading to Streamlit Community Cloud.

ModelDeployment1_ModelPredictionApplicationCode.png

1.7.2 User Interface Application Code Development ¶

  1. A user interface application code in Python was developed to:
    • enable binary category selection (Present|Absent) to identify the status of the test case for each of the ten clinical symptoms and behavioral indicators
    • process study population data as baseline
    • process user input as test case
    • render all entries into visualization charts
    • execute all computations, estimations and predictions
    • render test case prediction into logistic probability plot
  2. The user interface application code was saved in a repository that was eventually cloned for uploading to Streamlit Community Cloud.

ModelDeployment1_UserInterfaceApplicationCode.png

1.7.3 Web Application ¶

  1. The prediction model was deployed using a web application hosted at Streamlit.
  2. The user interface input consists of the following:
    • radio buttons to:
      • enable binary category selection (Present | Absent) to identify the status of the test case for each of the ten clinical symptoms and behavioral indicators
    • action button to:
      • process study population data as baseline
      • process user input as test case
      • render all entries into visualization charts
      • execute all computations, estimations and predictions
      • render test case prediction into logistic probability plot
  3. The user interface ouput consists of the following:
    • count plots to:
      • provide a visualization of the proportion of lung cancer categories (Yes | No) by status (Present | Absent) as baseline
      • indicate the entries made from the user input to visually assess the test case characteristics against the study population
    • logistic curve plot to:
      • provide a visualization of the baseline logistic regression probability curve using the study population with lung cancer categories (Yes | No)
      • indicate the estimated risk index and lung cancer probability of the test case into the baseline logistic regression probability curvee
    • summary table to:
      • present the computed risk index, estimated lung cancer probability and predicted risk category for the test case

ModelDeployment1_WebApplication.png

2. Summary ¶

ModelDeployment1_Summary_0.png

ModelDeployment1_Summary_1.png

ModelDeployment1_Summary_2.png

ModelDeployment1_Summary_3.png

3. References ¶

  • [Book] Data Preparation for Machine Learning: Data Cleaning, Feature Selection, and Data Transforms in Python by Jason Brownlee
  • [Book] Feature Engineering and Selection: A Practical Approach for Predictive Models by Max Kuhn and Kjell Johnson
  • [Book] Feature Engineering for Machine Learning by Alice Zheng and Amanda Casari
  • [Book] Applied Predictive Modeling by Max Kuhn and Kjell Johnson
  • [Book] Data Mining: Practical Machine Learning Tools and Techniques by Ian Witten, Eibe Frank, Mark Hall and Christopher Pal
  • [Book] Data Cleaning by Ihab Ilyas and Xu Chu
  • [Book] Data Wrangling with Python by Jacqueline Kazil and Katharine Jarmul
  • [Book] Regression Modeling Strategies by Frank Harrell
  • [Book] Ensemble Methods for Machine Learning by Gautam Kunapuli
  • [Book] Imbalanced Classification with Python: Better Metrics, Balance Skewed Classes, Cost-Sensitive Learning by Jason Brownlee
  • [Python Library API] NumPy by NumPy Team
  • [Python Library API] pandas by Pandas Team
  • [Python Library API] seaborn by Seaborn Team
  • [Python Library API] matplotlib.pyplot by MatPlotLib Team
  • [Python Library API] itertools by Python Team
  • [Python Library API] operator by Python Team
  • [Python Library API] sklearn.experimental by Scikit-Learn Team
  • [Python Library API] sklearn.impute by Scikit-Learn Team
  • [Python Library API] sklearn.linear_model by Scikit-Learn Team
  • [Python Library API] sklearn.preprocessing by Scikit-Learn Team
  • [Python Library API] scipy by SciPy Team
  • [Python Library API] sklearn.tree by Scikit-Learn Team
  • [Python Library API] sklearn.ensemble by Scikit-Learn Team
  • [Python Library API] sklearn.svm by Scikit-Learn Team
  • [Python Library API] sklearn.metrics by Scikit-Learn Team
  • [Python Library API] sklearn.model_selection by Scikit-Learn Team
  • [Python Library API] imblearn.over_sampling by Imbalanced-Learn Team
  • [Python Library API] imblearn.under_sampling by Imbalanced-Learn Team
  • [Python Library API] Streamlit by Streamlit Team
  • [Python Library API] Streamlit Community Cloud by Streamlit Team
  • [Article] Step-by-Step Exploratory Data Analysis (EDA) using Python by Malamahadevan Mahadevan (Analytics Vidhya)
  • [Article] Exploratory Data Analysis in Python — A Step-by-Step Process by Andrea D'Agostino (Towards Data Science)
  • [Article] Exploratory Data Analysis with Python by Douglas Rocha (Medium)
  • [Article] 4 Ways to Automate Exploratory Data Analysis (EDA) in Python by Abdishakur Hassan (BuiltIn)
  • [Article] 10 Things To Do When Conducting Your Exploratory Data Analysis (EDA) by Alifia Harmadi (Medium)
  • [Article] How to Handle Missing Data with Python by Jason Brownlee (Machine Learning Mastery)
  • [Article] Statistical Imputation for Missing Values in Machine Learning by Jason Brownlee (Machine Learning Mastery)
  • [Article] Imputing Missing Data with Simple and Advanced Techniques by Idil Ismiguzel (Towards Data Science)
  • [Article] Missing Data Imputation Approaches | How to handle missing values in Python by Selva Prabhakaran (Machine Learning +)
  • [Article] Master The Skills Of Missing Data Imputation Techniques In Python(2022) And Be Successful by Mrinal Walia (Analytics Vidhya)
  • [Article] How to Preprocess Data in Python by Afroz Chakure (BuiltIn)
  • [Article] Easy Guide To Data Preprocessing In Python by Ahmad Anis (KDNuggets)
  • [Article] Data Preprocessing in Python by Tarun Gupta (Towards Data Science)
  • [Article] Data Preprocessing using Python by Suneet Jain (Medium)
  • [Article] Data Preprocessing in Python by Abonia Sojasingarayar (Medium)
  • [Article] Data Preprocessing in Python by Afroz Chakure (Medium)
  • [Article] Detecting and Treating Outliers | Treating the Odd One Out! by Harika Bonthu (Analytics Vidhya)
  • [Article] Outlier Treatment with Python by Sangita Yemulwar (Analytics Vidhya)
  • [Article] A Guide to Outlier Detection in Python by Sadrach Pierre (BuiltIn)
  • [Article] How To Find Outliers in Data Using Python (and How To Handle Them) by Eric Kleppen (Career Foundry)
  • [Article] Statistics in Python — Collinearity and Multicollinearity by Wei-Meng Lee (Towards Data Science)
  • [Article] Understanding Multicollinearity and How to Detect it in Python by Terence Shin (Towards Data Science)
  • [Article] A Python Library to Remove Collinearity by Gianluca Malato (Your Data Teacher)
  • [Article] How to Normalize Data Using scikit-learn in Python by Jayant Verma (Digital Ocean)
  • [Article] What are Categorical Data Encoding Methods | Binary Encoding by Shipra Saxena (Analytics Vidhya)
  • [Article] Guide to Encoding Categorical Values in Python by Chris Moffitt (Practical Business Python)
  • [Article] Categorical Data Encoding Techniques in Python: A Complete Guide by Soumen Atta (Medium)
  • [Article] Categorical Feature Encoding Techniques by Tara Boyle (Medium)
  • [Article] Ordinal and One-Hot Encodings for Categorical Data by Jason Brownlee (Machine Learning Mastery)
  • [Article] Hypothesis Testing with Python: Step by Step Hands-On Tutorial with Practical Examples by Ece Işık Polat (Towards Data Science)
  • [Article] 17 Statistical Hypothesis Tests in Python (Cheat Sheet) by Jason Brownlee (Machine Learning Mastery)
  • [Article] A Step-by-Step Guide to Hypothesis Testing in Python using Scipy by Gabriel Rennó (Medium)
  • [Article] How to Evaluate Classification Models in Python: A Beginner's Guide by Sadrach Pierre (BuiltIn)
  • [Article] Machine Learning Classifiers Comparison with Python by Roberto Salazar (Towards Data Science)
  • [Article] Top 6 Machine Learning Algorithms for Classification by Destin Gong (Towards Data Science)
  • [Article] Metrics For Evaluating Machine Learning Classification Models by Cory Maklin (Towards Data Science)
  • [Article] Evaluation Metrics for Classification Problems with Implementation in Python by Venu Gopal Kadamba (Medium)
  • [Article] Tour of Evaluation Metrics for Imbalanced Classification by Jason Brownlee (Machine Learning Mastery)
  • [Article] Metrics To Evaluate Machine Learning Algorithms in Python by Jason Brownlee (Machine Learning Mastery)
  • [Article] How To Compare Machine Learning Algorithms in Python with scikit-learn by Jason Brownlee (Machine Learning Mastery)
  • [Article] How to Deal With Imbalanced Classification and Regression Data by Prince Canuma (Neptune.AI)
  • [Article] Random Oversampling and Undersampling for Imbalanced Classification by Jason Brownlee (Machine Learning Mastery)
  • [Article] How to Handle Imbalance Data and Small Training Sets in ML by Ege Hosgungor (Towards Data Science)
  • [Article] Class Imbalance Strategies — A Visual Guide with Code by Travis Tang (Towards Data Science)
  • [Article] Machine Learning: How to Handle Class Imbalance by Ken Hoffman (Medium)
  • [Article] Handling Class Imbalance in Machine Learning by Okan Yenigün (Medium)
  • [Article] Undersampling Algorithms for Imbalanced Classification by Jason Brownlee (Machine Learning Mastery)
  • [Article] Condensed Nearest Neighbor Rule Undersampling (CNN) & TomekLinks by Rupak Roy (Medium)
  • [Article] CNN (Condensed Nearest Neighbors) by Abhishek (Medium)
  • [Article] Synthetic Minority Over-sampling TEchnique (SMOTE) by Cory Maklin (Medium)
  • [Article] SMOTE for Imbalanced Classification with Python by Swastik Satpathy (Analytics Vidhya)
  • [Article] An Introduction to SMOTE by Abid Ali Awan (KD Nuggets)
  • [Article] 7 SMOTE Variations for Oversampling by Cornellius Yudha Wijaya (KD Nuggets)
  • [Article] A Comprehensive Guide to Ensemble Learning (with Python codes) by Aishwarya Singh (Analytics Vidhya)
  • [Article] Stacked Ensembles — Improving Model Performance on a Higher Level by Yenwee Lim (Towards Data Science)
  • [Article] Stacking to Improve Model Performance: A Comprehensive Guide on Ensemble Learning in Python by Brijesh Soni (Medium)
  • [Article] Stacking Ensemble Machine Learning With Python by Jason Brownlee (Machine Learning Mastery)
  • [Article] Machine Learning Model Deployment with FastAPI, Streamlit and Docker by Felipe Fernandez (Medium)
  • [Article] End-To-End Machine Learning using FastAPI, Streamlit, Docker, Google Cloud Platform by Marco Zanin (Medium)
  • [Article] FastAPI and Streamlit: The Python Duo You Must Know About by Paul Lusztin (Medium)
  • [Article] How to Build an Instant Machine Learning Web Application with Streamlit and FastAPI by Kurtis Pykes (Developer.Nvidia.Com)
  • [Article] ML - Deploy Machine Learning Models Using FastAPI by Dorian Machado (Medium)
  • [Article] FastAPI: The Modern Toolkit for Machine Learning Deployment by Reza Shokrzad (Medium)
  • [Article] Deploying and Hosting a Machine Learning Model with FastAPI and Heroku by Michael Herman (TestDriven.IO)
  • [Article] Using FastAPI to deploy Machine Learning models by Carl Handlin (Medium)
  • [Video Tutorial] Machine Learning Model with FastAPI, Streamlit and Docker by codetricks (YouTube)
  • [Video Tutorial] Machine learning model serving with streamlit and FastAPI - PyConES 2020 by Python Espana (YouTube)
  • [Video Tutorial] Deploying a Public Machine Learning Web App using Streamlit in Python | ML Deployment by Siddhardhan (YouTube)
  • [Video Tutorial] Deploy Machine Learning Model using Streamlit in Python | ML model Deployment by Siddhardhan (YouTube)
  • [Video Tutorial] How to Deploy Machine Learning Model as an API in Python - FastAPI by Siddhardhan (YouTube)
  • [Video Tutorial] Deploying Machine Learning model as API on Heroku | FastAPI | Heroku | Python | ML by Siddhardhan (YouTube)
  • [Video Tutorial] Deploying a Machine Learning web app using Streamlit on Heroku by Siddhardhan (YouTube)
  • [Video Tutorial] Deploy a Machine Learning Streamlit App Using Docker Containers | 2024 Tutorial | Step-by-Step Guide by Siddhardhan (YouTube)
  • [Video Tutorial] Deploying a Machine Learning model as Dockerized API | ML model Deployment | MLOPS by Siddhardhan (YouTube)
  • [Video Tutorial] Machine Learning Model Deployment with Python (Streamlit + MLflow) | Part 1/2 by DeepFindr (YouTube)
  • [Video Tutorial] Machine Learning Model Deployment with Python (Streamlit + MLflow) | Part 2/2 by DeepFindr (YouTube)
  • [Publication] Data Quality for Machine Learning Tasks by Nitin Gupta, Shashank Mujumdar, Hima Patel, Satoshi Masuda, Naveen Panwar, Sambaran Bandyopadhyay, Sameep Mehta, Shanmukha Guttula, Shazia Afzal, Ruhi Sharma Mittal and Vitobha Munigala (KDD ’21: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining)
  • [Publication] Overview and Importance of Data Quality for Machine Learning Tasks by Abhinav Jain, Hima Patel, Lokesh Nagalapatti, Nitin Gupta, Sameep Mehta, Shanmukha Guttula, Shashank Mujumdar, Shazia Afzal, Ruhi Sharma Mittal and Vitobha Munigala (KDD ’20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining)
  • [Publication] Mathematical Contributions to the Theory of Evolution: Regression, Heredity and Panmixia by Karl Pearson (Royal Society)
  • [Publication] The Probable Error of the Mean by Student (Biometrika)
  • [Publication] On the Criterion That a Given System of Deviations from the Probable in the Case of a Correlated System of Variables is Such That It can Be Reasonably Supposed to Have Arisen From Random Sampling by Karl Pearson (Philosophical Magazine)
  • [Publication] The Origins of Logistic Regression by JS Cramer (Econometrics eJournal)
  • [Publication] Classification and Regression Trees by Leo Breiman, Jerome Friedman, Richard Olshen and Charles Stone (Computer Science)
  • [Publication] Random Forest by Leo Breiman (Machine Learning)
  • [Publication] A Training Algorithm for Optimal Margin Classifiers by Bernhard Boser, Isabelle Guyon and Vladimir Vapnik (Proceedings of the Fifth Annual Workshop on Computational Learning Theory)
  • [Publication] SMOTE: Synthetic Minority Over-Sampling Technique by Nitesh Chawla, Kevin Bowyer, Lawrence Hall and Philip Kegelmeyer (Journal of Artificial Intelligence Research)
  • [Publication] The Condensed Nearest Neighbor Rule by Peter Hart (IEEE Transactions on Information Theory)
  • [Course] DataCamp Python Data Analyst Certificate by DataCamp Team (DataCamp)
  • [Course] DataCamp Python Associate Data Scientist Certificate by DataCamp Team (DataCamp)
  • [Course] DataCamp Python Data Scientist Certificate by DataCamp Team (DataCamp)
  • [Course] DataCamp Machine Learning Engineer Certificate by DataCamp Team (DataCamp)
  • [Course] DataCamp Machine Learning Scientist Certificate by DataCamp Team (DataCamp)
  • [Course] IBM Data Analyst Professional Certificate by IBM Team (Coursera)
  • [Course] IBM Data Science Professional Certificate by IBM Team (Coursera)
  • [Course] IBM Machine Learning Professional Certificate by IBM Team (Coursera)
In [226]:
from IPython.display import display, HTML
display(HTML("<style>.rendered_html { font-size: 15px; font-family: 'Trebuchet MS'; }</style>"))