Model Deployment : Detecting and Analyzing Machine Learning Model Drift Using Open-Source Monitoring Tools¶


John Pauline Pineda

October 15, 2025


  • 1. Table of Contents
    • 1.1 Data Background
    • 1.2 Data Description
    • 1.3 Data Quality Assessment
    • 1.4 Data Preprocessing
      • 1.4.1 Data Splitting
      • 1.4.2 Outlier and Distributional Shape Analysis
      • 1.4.3 Collinearity
    • 1.5 Data Exploration
      • 1.5.1 Exploratory Data Analysis
      • 1.5.2 Hypothesis Testing
    • 1.6 Premodelling Data Preparation
      • 1.6.1 Preprocessed Data Description
      • 1.6.2 Preprocessing Pipeline Development
    • 1.7 Model Development and Validation
      • 1.7.1 Random Forest
      • 1.7.2 AdaBoost
      • 1.7.3 Gradient Boosting
      • 1.7.4 XGBoost
      • 1.7.5 Light GBM
      • 1.7.6 CatBoost
    • 1.8 Model Monitoring using the NannyML Framework
      • 1.8.1 Baseline Control
      • 1.8.2 Simulated Covariate Drift
      • 1.8.3 Simulated Prior Shift
      • 1.8.4 Simulated Concept Drift
      • 1.8.5 Simulated Missingness Spike
      • 1.8.6 Simulated Seasonal Pattern
  • 2. Summary
  • 3. References

1. Table of Contents ¶

This project investigates open-source frameworks for post-deployment model monitoring and performance estimation, with a particular focus on NannyML or detecting and interpreting shifts in machine learning pipelines using Python. The objective was to systematically analyze how different types of drift and distribution changes manifest after model deployment, and to demonstrate how robust monitoring mitigates risks of performance degradation and biased decision-making. The workflow began with the development and selection of a baseline predictive model, which serves as a reference for stability. The dataset was then deliberately perturbed to simulate a range of realistic post-deployment scenarios: Covariate Drift (shifts in feature distributions), Prior Shift (changes in target label proportions), Concept Drift (evolving relationships between features and outcomes), Missingness Spikes (abrupt increases in absent data), and Seasonal Patterns (periodic variations in distributions). NannyML’s statistical tests, visualization capabilities, and performance estimation methods were subsequently applied to diagnose these shifts, evaluate their potential impact, and provide interpretable insights into model reliability. By contrasting baseline and perturbed conditions, the experiment demonstrated how continuous monitoring augments traditional offline evaluation, offering a safeguard against hidden risks. The findings highlighted how tools like NannyML can integrate seamlessly into MLOps workflows to enable proactive governance, early warning systems, and sustainable deployment practices. All results were consolidated in a Summary presented at the end of the document.

Post-Deployment Monitoring refers to the continuous oversight of machine learning models once they are integrated into production systems. Unlike offline evaluation, which relies on static validation datasets, monitoring addresses the challenges of evolving real-world data streams where underlying distributions may shift. Effective monitoring ensures that models remain accurate, unbiased, and aligned with business objectives. In MLOps, monitoring encompasses data integrity checks, drift detection, performance estimation, and alerting mechanisms. NannyML operationalizes this concept by focusing on performance estimation without ground truth, and by offering statistical methods to detect when data or predictions deviate from expected baselines. The challenges of post-deployment monitoring include delayed or missing ground truth labels, non-stationary data, hidden feedback loops, and difficulties distinguishing natural fluctuations from problematic drifts. Common solutions involve deploying drift detection algorithms, conducting regular audits of data pipelines, simulating counterfactuals, and retraining models on updated data. Monitoring frameworks must balance sensitivity (detecting real problems quickly) with robustness (avoiding false alarms caused by natural noise). Another key challenge is explainability: stakeholders need interpretable signals that justify interventions such as retraining or rolling back models. Tools like NannyML address these challenges through statistical tests for data drift, performance estimation without labels, missingness tracking, and visual diagnostics, making monitoring actionable for data scientists and business teams alike.

Covariate Shift occurs when the distribution of input features changes over time compared to the data used to train the model. Also known as data drift, it does not necessarily imply that the model’s predictive mapping is invalid, but it often precedes performance degradation. Detecting covariate drift requires comparing feature distributions between baseline (reference) data and incoming production data. NannyML provides multiple statistical tests and visualization tools to flag significant changes. Key signatures of covariate shift include shifts in summary statistics (mean, variance), changes in distributional shape, or increased divergence between reference and production feature distributions. These shifts may lead to poor generalization, as the model has not been exposed to the altered feature ranges. Detection techniques include univariate statistical tests (e.g., Kolmogorov–Smirnov, Chi-square), multivariate distance measures (e.g., Jensen–Shannon divergence, Population Stability Index), and density estimation methods. Remediation approaches involve domain adaptation, re-weighting training samples, or retraining models on updated data distributions. NannyML implements univariate and multivariate tests, provides drift magnitude quantification, and visualizes feature-level changes, allowing practitioners to pinpoint which features are most responsible for the detected drift.

Prior Shift arises when the distribution of the target variable changes, while the conditional relationship between features and labels remains stable. This is also referred to as label shift. Models trained on the original distribution may underperform because their predictions no longer match the new class priors. Detecting prior shifts is crucial, especially in imbalanced classification tasks where small changes in priors can lead to large performance impacts. Prior shift is typically characterized by systematic increases or decreases in class frequencies without corresponding changes in feature distributions. Its impact includes skewed decision thresholds, inflated false positives/negatives, and degraded calibration of predicted probabilities. Detection approaches include monitoring predicted class proportions, estimating priors using EM-based algorithms, and re-weighting predictions to align with new distributions. Correction strategies may involve resampling, threshold adjustment, or cost-sensitive learning. NannyML assists by tracking predicted probability distributions and comparing them against reference priors, using techniques such as KL divergence and PSI to quantify the magnitude of shift.

Concept Drift occurs when the underlying relationship between input features and target labels evolves over time. Unlike covariate shift, where features change independently, concept drift implies that the model’s mapping function itself becomes outdated. Concept drift is among the most damaging forms of drift because it directly undermines predictive accuracy. Detecting it often requires monitoring model outputs or inferred performance over time. NannyML addresses this by estimating performance even when ground truth labels are unavailable. Concept drift is typically signaled by a gradual or sudden decline in performance metrics, inconsistent error patterns, or misalignment between expected and actual prediction behavior. Its impact is severe: models may lose predictive power entirely if they cannot adapt. Detection methods include window-based performance monitoring, hypothesis testing, adaptive ensembles, and statistical monitoring of residuals. Corrective actions include periodic retraining, incremental learning, and online adaptation strategies. NannyML leverages Confidence-Based Performance Estimation (CBPE) and other statistical techniques to estimate performance degradation without labels, making it possible to detect concept drift in real-time production environments.

Missingness Spike refers to sudden increases in missing values within production data. Missing features can destabilize preprocessing pipelines, distort predictions, and signal upstream data collection failures. Monitoring missingness is critical for ensuring both model reliability and data pipeline health. NannyML provides built-in mechanisms to track and visualize changes in missing data patterns, alerting stakeholders before downstream impacts occur. Key indicators of missingness spikes include abrupt rises in null counts, missing categorical levels, or structural breaks in feature completeness. The consequences range from biased predictions to outright system failures if preprocessing pipelines cannot handle unexpected missingness. Detection methods include statistical monitoring of missing value proportions, anomaly detection on completeness metrics, and threshold-based alerts. Solutions typically involve robust imputation, pipeline hardening, and upstream data validation. NannyML offers automated missingness detection, completeness trend visualization, and configurable thresholds, ensuring that missingness issues are surfaced early.

Seasonal Pattern Shift represents periodic fluctuations in data distributions or outcomes that follow predictable cycles. If models are not trained with sufficient historical data to capture these patterns, their predictions may systematically underperform during certain periods. NannyML’s monitoring can reveal recurring deviations, helping teams distinguish between natural seasonality and genuine drift that requires retraining. Seasonality is often characterized by cyclic patterns in data features, prediction distributions, or performance metrics. Its impact includes systematic biases, recurring error peaks, and difficulty distinguishing drift from natural variability. Detection techniques include autocorrelation analysis, Fourier decomposition, and seasonal-trend decomposition. Mitigation strategies involve training with longer historical datasets, adding time-related features, or developing seasonally adaptive models. NannyML highlights recurring deviations in drift metrics, making it easier for practitioners to separate cyclical behavior from true degradation, ensuring that alerts are contextually relevant.

Performance Estimation Without Labels refers to scenarios in real-world deployments where the ground truth often arrives with delays—or may never be available. This makes direct performance tracking difficult. NannyML addresses this challenge by providing algorithms to estimate model performance without labels using confidence distributions, statistical inference, and robust estimation techniques. This capability allows practitioners to maintain visibility into model health continuously, even in label-scarce settings, bridging a critical gap in MLOps monitoring practices. Algorithms in this domain include Confidence-Based Performance Estimation (CBPE), which infers performance by comparing predicted probability distributions against expected confidence intervals, and Direct Loss Estimation, which approximates error rates based on calibration. Statistical inference techniques allow practitioners to construct confidence bounds around estimated metrics, while robust estimation mitigates the risk of spurious signals caused by small sample sizes or noisy predictions. NannyML provides implementations of CBPE and DLE, supporting metrics such as precision, recall, F1-score, and AUROC, all estimated without labels. This makes it possible to detect when a model is underperforming even before labels are collected, reducing blind spots in production monitoring.

Performance Estimation With Labels refers to the direct evaluation of model predictions against actual ground truth outcomes once labels are available. Unlike label-free methods, this approach allows for precise calculation of traditional performance metrics such as accuracy, precision, recall, F1-score, AUROC, and calibration error. Monitoring with labels provides the most reliable indication of model performance, enabling fine-grained diagnosis of errors and biases. The advantage of having labels is the ability to attribute errors to specific subgroups, detect fairness violations, and conduct targeted retraining. Challenges include label delay, annotation quality, and ensuring that labels accurately reflect the operational environment. Common approaches include sliding window evaluation, where performance is tracked over recent data batches, and benchmark comparison, where production metrics are compared to baseline test set results. NannyML incorporates labeled performance tracking alongside its label-free estimators, allowing users to validate estimates once ground truth becomes available. This dual capability ensures consistency, improves confidence in label-free methods, and provides a comprehensive framework for performance monitoring in both short-term and long-term horizons.

1.1. Data Background ¶

An open Breast Cancer Dataset from Kaggle (with all credits attributed to Wasiq Ali) was used for the analysis as consolidated from the following primary sources:

  1. Reference Repository entitled Differentiated breast Cancer Recurrence from UC Irvine Machine Learning Repository
  2. Research Paper entitled Nuclear Feature Extraction for Breast Tumor Diagnosis from the Electronic Imaging

This study hypothesized that the cell nuclei features derived from digitized images of fine needle aspirates (FNA) of breast masses influence breast cancer diagnoses between patients.

The dichotomous categorical variable for the study is:

  • diagnosis - Status of the patient (M, Medical diagnosis of a cancerous breast tumor | B, Medical diagnosis of a non-cancerous breast tumor)

The predictor variables for the study are:

  • radius_mean - Mean of the radius measurements (Mean of distances from center to points on the perimeter)
  • texture_mean - Mean of the texture measurements (Standard deviation of grayscale values)
  • perimeter_mean - Mean of the perimeter measurements
  • area_mean - Mean of the area measurements
  • smoothness_mean - Mean of the smoothness measurements (Local variation in radius lengths)
  • compactness_mean - Mean of the compactness measurements (Perimeter² / area - 1.0)
  • concavity_mean - Mean of the concavity measurements (Severity of concave portions of the contour)
  • concave points_mean - Mean of the concave points measurements (Number of concave portions of the contour)
  • symmetry_mean - Mean of the symmetry measurements
  • fractal_dimension_mean - Mean of the fractal dimension measurements (Coastline approximation - 1)
  • radius_se - Standard error of the radius measurements (Standard error of distances from center to points on the perimeter)
  • texture_se - Standard error of the texture measurements (Standard deviation of grayscale values)
  • perimeter_se - Standard error of the perimeter measurements
  • area_se - Standard error of the area measurements
  • smoothness_se - Standard error of the smoothness measurements (Local variation in radius lengths)
  • compactness_se - Standard error of the compactness measurements (Perimeter² / area - 1.0)
  • concavity_se - Standard error of the concavity measurements (Severity of concave portions of the contour)
  • concave points_se - Standard error of the concave points measurements (Number of concave portions of the contour)
  • symmetry_se - Standard error of the symmetry measurements
  • fractal_dimension_se - Standard error of the fractal dimension measurements (Coastline approximation - 1)
  • radius_worst - Largest value of the radius measurements (Largest value of distances from center to points on the perimeter)
  • texture_worst - Largest value of the texture measurements (Standard deviation of grayscale values)
  • perimeter_worst - Largest value of the perimeter measurements
  • area_worst - Largest value of the area measurements
  • smoothness_worst - Largest value of the smoothness measurements (Local variation in radius lengths)
  • compactness_worst - Largest value of the compactness measurements (Perimeter² / area - 1.0)
  • concavity_worst - Largest value of the concavity measurements (Severity of concave portions of the contour)
  • concave points_worst - Largest value of the concave points measurements (Number of concave portions of the contour)
  • symmetry_worst - Largest value of the symmetry measurements
  • fractal_dimension_worst - Largest value of the fractal dimension measurements (Coastline approximation - 1)

1.2. Data Description ¶

  1. The initial tabular dataset was comprised of 569 observations and 32 variables (including 1 metadata, 1 target and 30 predictors).
    • 569 rows (observations)
    • 32 columns (variables)
      • 1/32 metadata (categorical)
        • id
      • 1/32 target (categorical)
        • diagnosis
      • 30/32 predictor (numeric)
        • radius_mean
        • texture_mean
        • perimeter_mean
        • area_mean
        • smoothness_mean
        • compactness_mean
        • concavity_mean
        • concave points_mean
        • symmetry_mean
        • fractal_dimension_mean
        • radius_se
        • texture_se
        • perimeter_se
        • area_se
        • smoothness_se
        • compactness_se
        • concavity_se
        • concave points_se
        • symmetry_se
        • fractal_dimension_se
        • radius_worst
        • texture_worst
        • perimeter_worst
        • area_worst
        • smoothness_worst
        • compactness_worst
        • concavity_worst
        • concave points_worst
        • symmetry_worst
        • fractal_dimension_worst
  2. The id variable was transformed to a row index for the data observations.
In [141]:
##################################
# Loading Python Libraries
##################################
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import os
import re
import pickle
%matplotlib inline

import hashlib
import json
from urllib.parse import urlparse
import logging

from operator import truediv
from sklearn.preprocessing import OrdinalEncoder
from scipy import stats
from scipy.stats import pointbiserialr, chi2_contingency

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix, classification_report
from sklearn.model_selection import train_test_split, ParameterGrid, StratifiedShuffleSplit
from sklearn.base import clone
In [142]:
##################################
# Defining file paths
##################################
DATASETS_ORIGINAL_PATH = r"datasets\original"
DATASETS_FINAL_PATH = r"datasets\final\complete"
DATASETS_FINAL_TRAIN_PATH = r"datasets\final\train"
DATASETS_FINAL_TRAIN_FEATURES_PATH = r"datasets\final\train\features"
DATASETS_FINAL_TRAIN_TARGET_PATH = r"datasets\final\train\target"
DATASETS_FINAL_VALIDATION_PATH = r"datasets\final\validation"
DATASETS_FINAL_VALIDATION_FEATURES_PATH = r"datasets\final\validation\features"
DATASETS_FINAL_VALIDATION_TARGET_PATH = r"datasets\final\validation\target"
DATASETS_FINAL_TEST_PATH = r"datasets\final\test"
DATASETS_FINAL_TEST_FEATURES_PATH = r"datasets\final\test\features"
DATASETS_FINAL_TEST_TARGET_PATH = r"datasets\final\test\target"
DATASETS_PREPROCESSED_PATH = r"datasets\preprocessed"
DATASETS_PREPROCESSED_TRAIN_PATH = r"datasets\preprocessed\train"
DATASETS_PREPROCESSED_TRAIN_FEATURES_PATH = r"datasets\preprocessed\train\features"
DATASETS_PREPROCESSED_TRAIN_TARGET_PATH = r"datasets\preprocessed\train\target"
DATASETS_PREPROCESSED_VALIDATION_PATH = r"datasets\preprocessed\validation"
DATASETS_PREPROCESSED_VALIDATION_FEATURES_PATH = r"datasets\preprocessed\validation\features"
DATASETS_PREPROCESSED_VALIDATION_TARGET_PATH = r"datasets\preprocessed\validation\target"
DATASETS_PREPROCESSED_TEST_PATH = r"datasets\preprocessed\test"
DATASETS_PREPROCESSED_TEST_FEATURES_PATH = r"datasets\preprocessed\test\features"
DATASETS_PREPROCESSED_TEST_TARGET_PATH = r"datasets\preprocessed\test\target"
In [143]:
##################################
# Loading the dataset
# from the DATASETS_ORIGINAL_PATH
##################################
breast_cancer = pd.read_csv(os.path.join("..", DATASETS_ORIGINAL_PATH, "Breast_Cancer_Dataset.csv"))
In [144]:
##################################
# Performing a general exploration of the dataset
##################################
print('Dataset Dimensions: ')
display(breast_cancer.shape)
Dataset Dimensions: 
(569, 32)
In [145]:
##################################
# Listing the column names and data types
##################################
print('Column Names and Data Types:')
display(breast_cancer.dtypes)
Column Names and Data Types:
id                           int64
diagnosis                   object
radius_mean                float64
texture_mean               float64
perimeter_mean             float64
area_mean                  float64
smoothness_mean            float64
compactness_mean           float64
concavity_mean             float64
concave points_mean        float64
symmetry_mean              float64
fractal_dimension_mean     float64
radius_se                  float64
texture_se                 float64
perimeter_se               float64
area_se                    float64
smoothness_se              float64
compactness_se             float64
concavity_se               float64
concave points_se          float64
symmetry_se                float64
fractal_dimension_se       float64
radius_worst               float64
texture_worst              float64
perimeter_worst            float64
area_worst                 float64
smoothness_worst           float64
compactness_worst          float64
concavity_worst            float64
concave points_worst       float64
symmetry_worst             float64
fractal_dimension_worst    float64
dtype: object
In [146]:
##################################
# Setting the ID column as row names
##################################
breast_cancer = breast_cancer.set_index("id")
In [147]:
##################################
# Taking a snapshot of the dataset
##################################
breast_cancer.head()
Out[147]:
diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean symmetry_mean ... radius_worst texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave points_worst symmetry_worst fractal_dimension_worst
id
842302 M 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 0.2419 ... 25.38 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890
842517 M 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812 ... 24.99 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902
84300903 M 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 0.2069 ... 23.57 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758
84348301 M 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 0.2597 ... 14.91 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300
84358402 M 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 0.1809 ... 22.54 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678

5 rows × 31 columns

In [148]:
##################################
# Performing a general exploration of the numeric variables
##################################
print('Numeric Variable Summary:')
display(breast_cancer.describe(include='number').transpose())
Numeric Variable Summary:
count mean std min 25% 50% 75% max
radius_mean 569.0 14.127292 3.524049 6.981000 11.700000 13.370000 15.780000 28.11000
texture_mean 569.0 19.289649 4.301036 9.710000 16.170000 18.840000 21.800000 39.28000
perimeter_mean 569.0 91.969033 24.298981 43.790000 75.170000 86.240000 104.100000 188.50000
area_mean 569.0 654.889104 351.914129 143.500000 420.300000 551.100000 782.700000 2501.00000
smoothness_mean 569.0 0.096360 0.014064 0.052630 0.086370 0.095870 0.105300 0.16340
compactness_mean 569.0 0.104341 0.052813 0.019380 0.064920 0.092630 0.130400 0.34540
concavity_mean 569.0 0.088799 0.079720 0.000000 0.029560 0.061540 0.130700 0.42680
concave points_mean 569.0 0.048919 0.038803 0.000000 0.020310 0.033500 0.074000 0.20120
symmetry_mean 569.0 0.181162 0.027414 0.106000 0.161900 0.179200 0.195700 0.30400
fractal_dimension_mean 569.0 0.062798 0.007060 0.049960 0.057700 0.061540 0.066120 0.09744
radius_se 569.0 0.405172 0.277313 0.111500 0.232400 0.324200 0.478900 2.87300
texture_se 569.0 1.216853 0.551648 0.360200 0.833900 1.108000 1.474000 4.88500
perimeter_se 569.0 2.866059 2.021855 0.757000 1.606000 2.287000 3.357000 21.98000
area_se 569.0 40.337079 45.491006 6.802000 17.850000 24.530000 45.190000 542.20000
smoothness_se 569.0 0.007041 0.003003 0.001713 0.005169 0.006380 0.008146 0.03113
compactness_se 569.0 0.025478 0.017908 0.002252 0.013080 0.020450 0.032450 0.13540
concavity_se 569.0 0.031894 0.030186 0.000000 0.015090 0.025890 0.042050 0.39600
concave points_se 569.0 0.011796 0.006170 0.000000 0.007638 0.010930 0.014710 0.05279
symmetry_se 569.0 0.020542 0.008266 0.007882 0.015160 0.018730 0.023480 0.07895
fractal_dimension_se 569.0 0.003795 0.002646 0.000895 0.002248 0.003187 0.004558 0.02984
radius_worst 569.0 16.269190 4.833242 7.930000 13.010000 14.970000 18.790000 36.04000
texture_worst 569.0 25.677223 6.146258 12.020000 21.080000 25.410000 29.720000 49.54000
perimeter_worst 569.0 107.261213 33.602542 50.410000 84.110000 97.660000 125.400000 251.20000
area_worst 569.0 880.583128 569.356993 185.200000 515.300000 686.500000 1084.000000 4254.00000
smoothness_worst 569.0 0.132369 0.022832 0.071170 0.116600 0.131300 0.146000 0.22260
compactness_worst 569.0 0.254265 0.157336 0.027290 0.147200 0.211900 0.339100 1.05800
concavity_worst 569.0 0.272188 0.208624 0.000000 0.114500 0.226700 0.382900 1.25200
concave points_worst 569.0 0.114606 0.065732 0.000000 0.064930 0.099930 0.161400 0.29100
symmetry_worst 569.0 0.290076 0.061867 0.156500 0.250400 0.282200 0.317900 0.66380
fractal_dimension_worst 569.0 0.083946 0.018061 0.055040 0.071460 0.080040 0.092080 0.20750

1.3. Data Quality Assessment ¶

Data quality findings based on assessment are as follows:

  1. No duplicated rows were noted.
  2. No missing data noted for any variable with Null.Count>0 and Fill.Rate<1.0.
  3. No low variance observed for any variable with First.Second.Mode.Ratio>5.
  4. No low variance observed for any variable with Unique.Count.Ratio>10.
  5. High skewness observed for 5 variables with Skewness>3 or Skewness<(-3).
    • area_se: Skewness = 5.447
    • concavity_se: Skewness = 5.110
    • fractal_dimension_se: Skewness = 3.923
    • perimeter_se: Skewness = 3.443
    • radius_se: Skewness = 3.088
In [149]:
##################################
# Counting the number of duplicated rows
##################################
breast_cancer.duplicated().sum()
Out[149]:
np.int64(0)
In [150]:
##################################
# Gathering the data types for each column
##################################
data_type_list = list(breast_cancer.dtypes)
In [151]:
##################################
# Gathering the variable names for each column
##################################
variable_name_list = list(breast_cancer.columns)
In [152]:
##################################
# Gathering the number of observations for each column
##################################
row_count_list = list([len(breast_cancer)] * len(breast_cancer.columns))
In [153]:
##################################
# Gathering the number of missing data for each column
##################################
null_count_list = list(breast_cancer.isna().sum(axis=0))
In [154]:
##################################
# Gathering the number of non-missing data for each column
##################################
non_null_count_list = list(breast_cancer.count())
In [155]:
##################################
# Gathering the missing data percentage for each column
##################################
fill_rate_list = map(truediv, non_null_count_list, row_count_list)
In [156]:
##################################
# Formulating the summary
# for all columns
##################################
all_column_quality_summary = pd.DataFrame(zip(variable_name_list,
                                              data_type_list,
                                              row_count_list,
                                              non_null_count_list,
                                              null_count_list,
                                              fill_rate_list), 
                                        columns=['Column.Name',
                                                 'Column.Type',
                                                 'Row.Count',
                                                 'Non.Null.Count',
                                                 'Null.Count',                                                 
                                                 'Fill.Rate'])
display(all_column_quality_summary)
Column.Name Column.Type Row.Count Non.Null.Count Null.Count Fill.Rate
0 diagnosis object 569 569 0 1.0
1 radius_mean float64 569 569 0 1.0
2 texture_mean float64 569 569 0 1.0
3 perimeter_mean float64 569 569 0 1.0
4 area_mean float64 569 569 0 1.0
5 smoothness_mean float64 569 569 0 1.0
6 compactness_mean float64 569 569 0 1.0
7 concavity_mean float64 569 569 0 1.0
8 concave points_mean float64 569 569 0 1.0
9 symmetry_mean float64 569 569 0 1.0
10 fractal_dimension_mean float64 569 569 0 1.0
11 radius_se float64 569 569 0 1.0
12 texture_se float64 569 569 0 1.0
13 perimeter_se float64 569 569 0 1.0
14 area_se float64 569 569 0 1.0
15 smoothness_se float64 569 569 0 1.0
16 compactness_se float64 569 569 0 1.0
17 concavity_se float64 569 569 0 1.0
18 concave points_se float64 569 569 0 1.0
19 symmetry_se float64 569 569 0 1.0
20 fractal_dimension_se float64 569 569 0 1.0
21 radius_worst float64 569 569 0 1.0
22 texture_worst float64 569 569 0 1.0
23 perimeter_worst float64 569 569 0 1.0
24 area_worst float64 569 569 0 1.0
25 smoothness_worst float64 569 569 0 1.0
26 compactness_worst float64 569 569 0 1.0
27 concavity_worst float64 569 569 0 1.0
28 concave points_worst float64 569 569 0 1.0
29 symmetry_worst float64 569 569 0 1.0
30 fractal_dimension_worst float64 569 569 0 1.0
In [157]:
##################################
# Counting the number of columns
# with Fill.Rate < 1.00
##################################
len(all_column_quality_summary[(all_column_quality_summary['Fill.Rate']<1)])
Out[157]:
0
In [158]:
##################################
# Identifying the rows
# with Fill.Rate < 0.90
##################################
column_low_fill_rate = all_column_quality_summary[(all_column_quality_summary['Fill.Rate']<0.90)]
In [159]:
##################################
# Gathering the indices for each observation
##################################
row_index_list = breast_cancer.index
In [160]:
##################################
# Gathering the number of columns for each observation
##################################
column_count_list = list([len(breast_cancer.columns)] * len(breast_cancer))
In [161]:
##################################
# Gathering the number of missing data for each row
##################################
null_row_list = list(breast_cancer.isna().sum(axis=1))
In [162]:
##################################
# Gathering the missing data percentage for each column
##################################
missing_rate_list = map(truediv, null_row_list, column_count_list)
In [163]:
##################################
# Identifying the rows
# with missing data
##################################
all_row_quality_summary = pd.DataFrame(zip(row_index_list,
                                           column_count_list,
                                           null_row_list,
                                           missing_rate_list), 
                                        columns=['Row.Name',
                                                 'Column.Count',
                                                 'Null.Count',                                                 
                                                 'Missing.Rate'])
display(all_row_quality_summary)
Row.Name Column.Count Null.Count Missing.Rate
0 842302 31 0 0.0
1 842517 31 0 0.0
2 84300903 31 0 0.0
3 84348301 31 0 0.0
4 84358402 31 0 0.0
... ... ... ... ...
564 926424 31 0 0.0
565 926682 31 0 0.0
566 926954 31 0 0.0
567 927241 31 0 0.0
568 92751 31 0 0.0

569 rows × 4 columns

In [164]:
##################################
# Counting the number of rows
# with Missing.Rate > 0.00
##################################
len(all_row_quality_summary[(all_row_quality_summary['Missing.Rate']>0.00)])
Out[164]:
0
In [165]:
##################################
# Formulating the dataset
# with numeric columns only
##################################
breast_cancer_numeric = breast_cancer.select_dtypes(include='number')
In [166]:
##################################
# Gathering the variable names for each numeric column
##################################
numeric_variable_name_list = breast_cancer_numeric.columns
In [167]:
##################################
# Gathering the minimum value for each numeric column
##################################
numeric_minimum_list = breast_cancer_numeric.min()
In [168]:
##################################
# Gathering the mean value for each numeric column
##################################
numeric_mean_list = breast_cancer_numeric.mean()
In [169]:
##################################
# Gathering the median value for each numeric column
##################################
numeric_median_list = breast_cancer_numeric.median()
In [170]:
##################################
# Gathering the maximum value for each numeric column
##################################
numeric_maximum_list = breast_cancer_numeric.max()
In [171]:
##################################
# Gathering the first mode values for each numeric column
##################################
numeric_first_mode_list = [breast_cancer[x].value_counts(dropna=True).index.tolist()[0] for x in breast_cancer_numeric]
In [172]:
##################################
# Gathering the second mode values for each numeric column
##################################
numeric_second_mode_list = [breast_cancer[x].value_counts(dropna=True).index.tolist()[1] for x in breast_cancer_numeric]
In [173]:
##################################
# Gathering the count of first mode values for each numeric column
##################################
numeric_first_mode_count_list = [breast_cancer_numeric[x].isin([breast_cancer[x].value_counts(dropna=True).index.tolist()[0]]).sum() for x in breast_cancer_numeric]
In [174]:
##################################
# Gathering the count of second mode values for each numeric column
##################################
numeric_second_mode_count_list = [breast_cancer_numeric[x].isin([breast_cancer[x].value_counts(dropna=True).index.tolist()[1]]).sum() for x in breast_cancer_numeric]
In [175]:
##################################
# Gathering the first mode to second mode ratio for each numeric column
##################################
numeric_first_second_mode_ratio_list = map(truediv, numeric_first_mode_count_list, numeric_second_mode_count_list)
In [176]:
##################################
# Gathering the count of unique values for each numeric column
##################################
numeric_unique_count_list = breast_cancer_numeric.nunique(dropna=True)
In [177]:
##################################
# Gathering the number of observations for each numeric column
##################################
numeric_row_count_list = list([len(breast_cancer_numeric)] * len(breast_cancer_numeric.columns))
In [178]:
##################################
# Gathering the unique to count ratio for each numeric column
##################################
numeric_unique_count_ratio_list = map(truediv, numeric_unique_count_list, numeric_row_count_list)
In [179]:
##################################
# Gathering the skewness value for each numeric column
##################################
numeric_skewness_list = breast_cancer_numeric.skew()
In [180]:
##################################
# Gathering the kurtosis value for each numeric column
##################################
numeric_kurtosis_list = breast_cancer_numeric.kurtosis()
In [181]:
##################################
# Generating a column quality summary for the numeric column
##################################
numeric_column_quality_summary = pd.DataFrame(zip(numeric_variable_name_list,
                                                numeric_minimum_list,
                                                numeric_mean_list,
                                                numeric_median_list,
                                                numeric_maximum_list,
                                                numeric_first_mode_list,
                                                numeric_second_mode_list,
                                                numeric_first_mode_count_list,
                                                numeric_second_mode_count_list,
                                                numeric_first_second_mode_ratio_list,
                                                numeric_unique_count_list,
                                                numeric_row_count_list,
                                                numeric_unique_count_ratio_list,
                                                numeric_skewness_list,
                                                numeric_kurtosis_list), 
                                        columns=['Numeric.Column.Name',
                                                 'Minimum',
                                                 'Mean',
                                                 'Median',
                                                 'Maximum',
                                                 'First.Mode',
                                                 'Second.Mode',
                                                 'First.Mode.Count',
                                                 'Second.Mode.Count',
                                                 'First.Second.Mode.Ratio',
                                                 'Unique.Count',
                                                 'Row.Count',
                                                 'Unique.Count.Ratio',
                                                 'Skewness',
                                                 'Kurtosis'])
display(numeric_column_quality_summary)
Numeric.Column.Name Minimum Mean Median Maximum First.Mode Second.Mode First.Mode.Count Second.Mode.Count First.Second.Mode.Ratio Unique.Count Row.Count Unique.Count.Ratio Skewness Kurtosis
0 radius_mean 6.981000 14.127292 13.370000 28.11000 12.340000 11.060000 4 3 1.333333 456 569 0.801406 0.942380 0.845522
1 texture_mean 9.710000 19.289649 18.840000 39.28000 16.840000 19.830000 3 3 1.000000 479 569 0.841828 0.650450 0.758319
2 perimeter_mean 43.790000 91.969033 86.240000 188.50000 82.610000 134.700000 3 3 1.000000 522 569 0.917399 0.990650 0.972214
3 area_mean 143.500000 654.889104 551.100000 2501.00000 512.200000 394.100000 3 2 1.500000 539 569 0.947276 1.645732 3.652303
4 smoothness_mean 0.052630 0.096360 0.095870 0.16340 0.100700 0.105400 5 4 1.250000 474 569 0.833040 0.456324 0.855975
5 compactness_mean 0.019380 0.104341 0.092630 0.34540 0.114700 0.120600 3 3 1.000000 537 569 0.943761 1.190123 1.650130
6 concavity_mean 0.000000 0.088799 0.061540 0.42680 0.000000 0.120400 13 3 4.333333 537 569 0.943761 1.401180 1.998638
7 concave points_mean 0.000000 0.048919 0.033500 0.20120 0.000000 0.028640 13 3 4.333333 542 569 0.952548 1.171180 1.066556
8 symmetry_mean 0.106000 0.181162 0.179200 0.30400 0.176900 0.189300 4 4 1.000000 432 569 0.759227 0.725609 1.287933
9 fractal_dimension_mean 0.049960 0.062798 0.061540 0.09744 0.067820 0.061130 3 3 1.000000 499 569 0.876977 1.304489 3.005892
10 radius_se 0.111500 0.405172 0.324200 2.87300 0.286000 0.220400 3 3 1.000000 540 569 0.949033 3.088612 17.686726
11 texture_se 0.360200 1.216853 1.108000 4.88500 0.856100 1.350000 3 3 1.000000 519 569 0.912127 1.646444 5.349169
12 perimeter_se 0.757000 2.866059 2.287000 21.98000 1.778000 1.143000 4 2 2.000000 533 569 0.936731 3.443615 21.401905
13 area_se 6.802000 40.337079 24.530000 542.20000 16.970000 16.640000 3 3 1.000000 528 569 0.927944 5.447186 49.209077
14 smoothness_se 0.001713 0.007041 0.006380 0.03113 0.005910 0.006064 2 2 1.000000 547 569 0.961336 2.314450 10.469840
15 compactness_se 0.002252 0.025478 0.020450 0.13540 0.018120 0.011040 3 3 1.000000 541 569 0.950791 1.902221 5.106252
16 concavity_se 0.000000 0.031894 0.025890 0.39600 0.000000 0.021850 13 2 6.500000 533 569 0.936731 5.110463 48.861395
17 concave points_se 0.000000 0.011796 0.010930 0.05279 0.000000 0.011670 13 3 4.333333 507 569 0.891037 1.444678 5.126302
18 symmetry_se 0.007882 0.020542 0.018730 0.07895 0.013440 0.020450 4 3 1.333333 498 569 0.875220 2.195133 7.896130
19 fractal_dimension_se 0.000895 0.003795 0.003187 0.02984 0.002256 0.002205 2 2 1.000000 545 569 0.957821 3.923969 26.280847
20 radius_worst 7.930000 16.269190 14.970000 36.04000 12.360000 13.500000 5 4 1.250000 457 569 0.803163 1.103115 0.944090
21 texture_worst 12.020000 25.677223 25.410000 49.54000 17.700000 27.260000 3 3 1.000000 511 569 0.898067 0.498321 0.224302
22 perimeter_worst 50.410000 107.261213 97.660000 251.20000 117.700000 105.900000 3 3 1.000000 514 569 0.903339 1.128164 1.070150
23 area_worst 185.200000 880.583128 686.500000 4254.00000 698.800000 808.900000 2 2 1.000000 544 569 0.956063 1.859373 4.396395
24 smoothness_worst 0.071170 0.132369 0.131300 0.22260 0.140100 0.131200 4 4 1.000000 411 569 0.722320 0.415426 0.517825
25 compactness_worst 0.027290 0.254265 0.211900 1.05800 0.148600 0.341600 3 3 1.000000 529 569 0.929701 1.473555 3.039288
26 concavity_worst 0.000000 0.272188 0.226700 1.25200 0.000000 0.450400 13 3 4.333333 539 569 0.947276 1.150237 1.615253
27 concave points_worst 0.000000 0.114606 0.099930 0.29100 0.000000 0.110500 13 3 4.333333 492 569 0.864675 0.492616 -0.535535
28 symmetry_worst 0.156500 0.290076 0.282200 0.66380 0.236900 0.310900 3 3 1.000000 500 569 0.878735 1.433928 4.444560
29 fractal_dimension_worst 0.055040 0.083946 0.080040 0.20750 0.074270 0.087010 3 2 1.500000 535 569 0.940246 1.662579 5.244611
In [182]:
##################################
# Counting the number of numeric columns
# with First.Second.Mode.Ratio > 5.00
##################################
len(numeric_column_quality_summary[(numeric_column_quality_summary['First.Second.Mode.Ratio']>10)])
Out[182]:
0
In [183]:
##################################
# Counting the number of numeric columns
# with Unique.Count.Ratio > 10.00
##################################
len(numeric_column_quality_summary[(numeric_column_quality_summary['Unique.Count.Ratio']>10)])
Out[183]:
0
In [184]:
#################################
# Counting the number of numeric columns
# with Skewness > 3.00 or Skewness < -3.00
##################################
len(numeric_column_quality_summary[(numeric_column_quality_summary['Skewness']>3) | (numeric_column_quality_summary['Skewness']<(-3))])
Out[184]:
5
In [185]:
##################################
# Identifying the numerical columns
# with Skewness > 3.00 or Skewness < -3.00
##################################
display(numeric_column_quality_summary[(numeric_column_quality_summary['Skewness']>3) | (numeric_column_quality_summary['Skewness']<(-3))].sort_values(by=['Skewness'], ascending=False))
Numeric.Column.Name Minimum Mean Median Maximum First.Mode Second.Mode First.Mode.Count Second.Mode.Count First.Second.Mode.Ratio Unique.Count Row.Count Unique.Count.Ratio Skewness Kurtosis
13 area_se 6.802000 40.337079 24.530000 542.20000 16.970000 16.640000 3 3 1.0 528 569 0.927944 5.447186 49.209077
16 concavity_se 0.000000 0.031894 0.025890 0.39600 0.000000 0.021850 13 2 6.5 533 569 0.936731 5.110463 48.861395
19 fractal_dimension_se 0.000895 0.003795 0.003187 0.02984 0.002256 0.002205 2 2 1.0 545 569 0.957821 3.923969 26.280847
12 perimeter_se 0.757000 2.866059 2.287000 21.98000 1.778000 1.143000 4 2 2.0 533 569 0.936731 3.443615 21.401905
10 radius_se 0.111500 0.405172 0.324200 2.87300 0.286000 0.220400 3 3 1.0 540 569 0.949033 3.088612 17.686726
In [186]:
##################################
# Formulating the dataset
# with categorical columns only
##################################
breast_cancer_categorical = breast_cancer.select_dtypes(include=['category','object'])
In [187]:
##################################
# Gathering the variable names for the categorical column
##################################
categorical_variable_name_list = breast_cancer_categorical.columns
In [188]:
##################################
# Gathering the first mode values for each categorical column
##################################
categorical_first_mode_list = [breast_cancer[x].value_counts().index.tolist()[0] for x in breast_cancer_categorical]
In [189]:
##################################
# Gathering the second mode values for each categorical column
##################################
categorical_second_mode_list = [breast_cancer[x].value_counts().index.tolist()[1] for x in breast_cancer_categorical]
In [190]:
##################################
# Gathering the count of first mode values for each categorical column
##################################
categorical_first_mode_count_list = [breast_cancer_categorical[x].isin([breast_cancer[x].value_counts(dropna=True).index.tolist()[0]]).sum() for x in breast_cancer_categorical]
In [191]:
##################################
# Gathering the count of second mode values for each categorical column
##################################
categorical_second_mode_count_list = [breast_cancer_categorical[x].isin([breast_cancer[x].value_counts(dropna=True).index.tolist()[1]]).sum() for x in breast_cancer_categorical]
In [192]:
##################################
# Gathering the first mode to second mode ratio for each categorical column
##################################
categorical_first_second_mode_ratio_list = map(truediv, categorical_first_mode_count_list, categorical_second_mode_count_list)
In [193]:
##################################
# Gathering the count of unique values for each categorical column
##################################
categorical_unique_count_list = breast_cancer_categorical.nunique(dropna=True)
In [194]:
##################################
# Gathering the number of observations for each categorical column
##################################
categorical_row_count_list = list([len(breast_cancer_categorical)] * len(breast_cancer_categorical.columns))
In [195]:
##################################
# Gathering the unique to count ratio for each categorical column
##################################
categorical_unique_count_ratio_list = map(truediv, categorical_unique_count_list, categorical_row_count_list)
In [196]:
##################################
# Generating a column quality summary for the categorical columns
##################################
categorical_column_quality_summary = pd.DataFrame(zip(categorical_variable_name_list,
                                                    categorical_first_mode_list,
                                                    categorical_second_mode_list,
                                                    categorical_first_mode_count_list,
                                                    categorical_second_mode_count_list,
                                                    categorical_first_second_mode_ratio_list,
                                                    categorical_unique_count_list,
                                                    categorical_row_count_list,
                                                    categorical_unique_count_ratio_list), 
                                        columns=['Categorical.Column.Name',
                                                 'First.Mode',
                                                 'Second.Mode',
                                                 'First.Mode.Count',
                                                 'Second.Mode.Count',
                                                 'First.Second.Mode.Ratio',
                                                 'Unique.Count',
                                                 'Row.Count',
                                                 'Unique.Count.Ratio'])
display(categorical_column_quality_summary)
Categorical.Column.Name First.Mode Second.Mode First.Mode.Count Second.Mode.Count First.Second.Mode.Ratio Unique.Count Row.Count Unique.Count.Ratio
0 diagnosis B M 357 212 1.683962 2 569 0.003515
In [197]:
##################################
# Counting the number of categorical columns
# with First.Second.Mode.Ratio > 5.00
##################################
len(categorical_column_quality_summary[(categorical_column_quality_summary['First.Second.Mode.Ratio']>5)])
Out[197]:
0
In [198]:
##################################
# Counting the number of categorical columns
# with Unique.Count.Ratio > 10.00
##################################
len(categorical_column_quality_summary[(categorical_column_quality_summary['Unique.Count.Ratio']>10)])
Out[198]:
0

1.4. Data Preprocessing ¶

1.4.1 Data Splitting¶

  1. The baseline dataset is comprised of:
    • 569 rows (observations)
      • 357 diagnosis=B: 62.74%
      • 212 diagnosis=M: 37.26%
    • 31 columns (variables)
      • 1/31 target (categorical)
        • diagnosis
      • 30/31 predictor (numeric)
        • radius_mean
        • texture_mean
        • perimeter_mean
        • area_mean
        • smoothness_mean
        • compactness_mean
        • concavity_mean
        • concave points_mean
        • symmetry_mean
        • fractal_dimension_mean
        • radius_se
        • texture_se
        • perimeter_se
        • area_se
        • smoothness_se
        • compactness_se
        • concavity_se
        • concave points_se
        • symmetry_se
        • fractal_dimension_se
        • radius_worst
        • texture_worst
        • perimeter_worst
        • area_worst
        • smoothness_worst
        • compactness_worst
        • concavity_worst
        • concave points_worst
        • symmetry_worst
        • fractal_dimension_worst
  2. The baseline dataset was divided into three subsets using a fixed random seed:
    • test data: 25% of the original data with class stratification applied
    • train data (initial): 75% of the original data with class stratification applied
      • train data (final): 75% of the train (initial) data with class stratification applied
      • validation data: 25% of the train (initial) data with class stratification applied
  3. Models were developed from the train data (final). Using the same dataset, a subset of models with optimal hyperparameters were selected, based on cross-validation.
  4. Among candidate models with optimal hyperparameters, the final model was selected based on performance on the validation data.
  5. Performance of the selected final model (and other candidate models for post-model selection comparison) were evaluated using the test data.
  6. The train data (final) subset is comprised of:
    • 319 rows (observations)
      • 200 diagnosis=B: 62.69%
      • 119 diagnosis=M: 37.30%
    • 31 columns (variables)
  7. The validation data subset is comprised of:
    • 107 rows (observations)
      • 67 diagnosis=B: 62.61%
      • 40 diagnosis=M: 37.38%
    • 31 columns (variables)
  8. The test data subset is comprised of:
    • 143 rows (observations)
      • 90 diagnosis=B: 62.93%
      • 53 diagnosis=M: 37.06%
    • 31 columns (variables)
In [199]:
##################################
# Creating a dataset copy
# of the original data
##################################
breast_cancer_baseline = breast_cancer.copy()
In [200]:
##################################
# Performing a general exploration
# of the baseline dataset
##################################
print('Final Dataset Dimensions: ')
display(breast_cancer_baseline.shape)
Final Dataset Dimensions: 
(569, 31)
In [201]:
##################################
# Obtaining the distribution of
# of the target variable
##################################
print('Target Variable Breakdown: ')
breast_cancer_breakdown = breast_cancer_baseline.groupby('diagnosis', observed=True).size().reset_index(name='Count')
breast_cancer_breakdown['Percentage'] = (breast_cancer_breakdown['Count'] / len(breast_cancer_baseline)) * 100
display(breast_cancer_breakdown)
Target Variable Breakdown: 
diagnosis Count Percentage
0 B 357 62.741652
1 M 212 37.258348
In [202]:
##################################
# Formulating the train and test data
# from the final dataset
# by applying stratification and
# using a 75-25 ratio
##################################
breast_cancer_train_initial, breast_cancer_test = train_test_split(breast_cancer_baseline, 
                                                               test_size=0.25, 
                                                               stratify=breast_cancer_baseline['diagnosis'], 
                                                               random_state=987654321)
In [203]:
##################################
# Performing a general exploration
# of the initial training dataset
##################################
X_train_initial = breast_cancer_train_initial.drop('diagnosis', axis = 1)
y_train_initial = breast_cancer_train_initial['diagnosis']
print('Initial Train Dataset Dimensions: ')
display(X_train_initial.shape)
display(y_train_initial.shape)
print('Initial Train Target Variable Breakdown: ')
display(y_train_initial.value_counts())
print('Initial Train Target Variable Proportion: ')
display(y_train_initial.value_counts(normalize = True))
Initial Train Dataset Dimensions: 
(426, 30)
(426,)
Initial Train Target Variable Breakdown: 
diagnosis
B    267
M    159
Name: count, dtype: int64
Initial Train Target Variable Proportion: 
diagnosis
B    0.626761
M    0.373239
Name: proportion, dtype: float64
In [204]:
##################################
# Performing a general exploration
# of the test dataset
##################################
X_test = breast_cancer_test.drop('diagnosis', axis = 1)
y_test = breast_cancer_test['diagnosis']
print('Test Dataset Dimensions: ')
display(X_test.shape)
display(y_test.shape)
print('Test Target Variable Breakdown: ')
display(y_test.value_counts())
print('Test Target Variable Proportion: ')
display(y_test.value_counts(normalize = True))
Test Dataset Dimensions: 
(143, 30)
(143,)
Test Target Variable Breakdown: 
diagnosis
B    90
M    53
Name: count, dtype: int64
Test Target Variable Proportion: 
diagnosis
B    0.629371
M    0.370629
Name: proportion, dtype: float64
In [205]:
##################################
# Formulating the train and validation data
# from the train dataset
# by applying stratification and
# using a 75-25 ratio
##################################
breast_cancer_train, breast_cancer_validation = train_test_split(breast_cancer_train_initial, 
                                                             test_size=0.25, 
                                                             stratify=breast_cancer_train_initial['diagnosis'], 
                                                             random_state=987654321)
In [206]:
##################################
# Performing a general exploration
# of the final training dataset
##################################
X_train = breast_cancer_train.drop('diagnosis', axis = 1)
y_train = breast_cancer_train['diagnosis']
print('Final Train Dataset Dimensions: ')
display(X_train.shape)
display(y_train.shape)
print('Final Train Target Variable Breakdown: ')
display(y_train.value_counts())
print('Final Train Target Variable Proportion: ')
display(y_train.value_counts(normalize = True))
Final Train Dataset Dimensions: 
(319, 30)
(319,)
Final Train Target Variable Breakdown: 
diagnosis
B    200
M    119
Name: count, dtype: int64
Final Train Target Variable Proportion: 
diagnosis
B    0.626959
M    0.373041
Name: proportion, dtype: float64
In [207]:
##################################
# Performing a general exploration
# of the validation dataset
##################################
X_validation = breast_cancer_validation.drop('diagnosis', axis = 1)
y_validation = breast_cancer_validation['diagnosis']
print('Validation Dataset Dimensions: ')
display(X_validation.shape)
display(y_validation.shape)
print('Validation Target Variable Breakdown: ')
display(y_validation.value_counts())
print('Validation Target Variable Proportion: ')
display(y_validation.value_counts(normalize = True))
Validation Dataset Dimensions: 
(107, 30)
(107,)
Validation Target Variable Breakdown: 
diagnosis
B    67
M    40
Name: count, dtype: int64
Validation Target Variable Proportion: 
diagnosis
B    0.626168
M    0.373832
Name: proportion, dtype: float64
In [208]:
##################################
# Saving the training data
# to the DATASETS_FINAL_TRAIN_PATH
# and DATASETS_FINAL_TRAIN_FEATURES_PATH
# and DATASETS_FINAL_TRAIN_TARGET_PATH
##################################
breast_cancer_train.to_csv(os.path.join("..", DATASETS_FINAL_TRAIN_PATH, "breast_cancer_train.csv"), index=False)
X_train.to_csv(os.path.join("..", DATASETS_FINAL_TRAIN_FEATURES_PATH, "X_train.csv"), index=False)
y_train.to_csv(os.path.join("..", DATASETS_FINAL_TRAIN_TARGET_PATH, "y_train.csv"), index=False)
In [209]:
##################################
# Saving the validation data
# to the DATASETS_FINAL_VALIDATION_PATH
# and DATASETS_FINAL_VALIDATION_FEATURE_PATH
# and DATASETS_FINAL_VALIDATION_TARGET_PATH
##################################
breast_cancer_validation.to_csv(os.path.join("..", DATASETS_FINAL_VALIDATION_PATH, "breast_cancer_validation.csv"), index=False)
X_validation.to_csv(os.path.join("..", DATASETS_FINAL_VALIDATION_FEATURES_PATH, "X_validation.csv"), index=False)
y_validation.to_csv(os.path.join("..", DATASETS_FINAL_VALIDATION_TARGET_PATH, "y_validation.csv"), index=False)
In [210]:
##################################
# Saving the test data
# to the DATASETS_FINAL_TEST_PATH
# and DATASETS_FINAL_TEST_FEATURES_PATH
# and DATASETS_FINAL_TEST_TARGET_PATH
##################################
breast_cancer_test.to_csv(os.path.join("..", DATASETS_FINAL_TEST_PATH, "breast_cancer_test.csv"), index=False)
X_test.to_csv(os.path.join("..", DATASETS_FINAL_TEST_FEATURES_PATH, "X_test.csv"), index=False)
y_test.to_csv(os.path.join("..", DATASETS_FINAL_TEST_TARGET_PATH, "y_test.csv"), index=False)

1.4.2 Outlier and Distributional Shape Analysis¶

1.4.3 Collinearity¶

1.5. Data Exploration ¶

1.5.1 Exploratory Data Analysis¶

1.5.2 Hypothesis Testing¶

1.6. Premodelling Data Preparation ¶

1.6.1 Preprocessed Data Description¶

1.6.2 Preprocessing Pipeline Development¶

1.7. Model Development and Validation ¶

1.7.1 Random Forest¶

1.7.2 AdaBoost¶

1.7.3 Gradient Boosting¶

1.7.4 XGBoost¶

1.7.5 Light GBM¶

1.7.6 CatBoost¶

1.8. Model Monitoring using the NannyML Framework ¶

1.8.1 Baseline Control¶

1.8.2 Simulated Covariate Drift¶

1.8.3 Simulated Prior Shift¶

1.8.4 Simulated Concept Drift¶

1.8.5 Simulated Missingness Spike¶

1.8.6 Simulated Seasonal Pattern¶

1.9. Consolidated Findings ¶

2. Summary ¶

3. References ¶

  • [Book] Reliable Machine Learning by Cathy Chen, Niall Richard Murphy, Kranti Parisa, D. Sculley and Todd Underwood
  • [Book] Designing Machine Learning Systems by Chip Huyen
  • [Book] Machine Learning Design Patterns by Valliappa Lakshmanan, Sara Robinson and Michael Munn
  • [Book] Machine Learning Engineering by Andriy Burkov
  • [Book] Engineering MLOps by Emmanuel Raj
  • [Book] Introducing MLOps by Mark Treveil, Nicolas Omont, Clément Stenac, Kenji Lefevre, Du Phan, Joachim Zentici, Adrien Lavoillotte, Makoto Miyazaki and Lynn Heidmann
  • [Book] Practical MLOps by Noah Gift and Alfredo Deza
  • [Book] Data Science on AWS by Chris Fregly and Antje Barth
  • [Book] Ensemble Methods for Machine Learning by Gautam Kunapuli
  • [Book] Applied Predictive Modeling by Max Kuhn and Kjell Johnson
  • [Book] An Introduction to Statistical Learning by Gareth James, Daniela Witten, Trevor Hastie and Rob Tibshirani
  • [Book] Ensemble Methods: Foundations and Algorithms by Zhi-Hua Zhou
  • [Book] Effective XGBoost: Optimizing, Tuning, Understanding, and Deploying Classification Models (Treading on Python) by Matt Harrison, Edward Krueger, Alex Rook, Ronald Legere and Bojan Tunguz
  • [Python Library API] nannyML by NannyML Team
  • [Python Library API] NumPy by NumPy Team
  • [Python Library API] pandas by Pandas Team
  • [Python Library API] seaborn by Seaborn Team
  • [Python Library API] matplotlib.pyplot by MatPlotLib Team
  • [Python Library API] itertools by Python Team
  • [Python Library API] sklearn.experimental by Scikit-Learn Team
  • [Python Library API] sklearn.preprocessing by Scikit-Learn Team
  • [Python Library API] scipy by SciPy Team
  • [Python Library API] sklearn.tree by Scikit-Learn Team
  • [Python Library API] sklearn.ensemble by Scikit-Learn Team
  • [Python Library API] sklearn.metrics by Scikit-Learn Team
  • [Python Library API] xgboost by XGBoost Team
  • [Python Library API] lightgbm by LightGBM Team
  • [Python Library API] catboost by CatBoost Team
  • [Python Library API] StatsModels by StatsModels Team
  • [Python Library API] SciPy by SciPy Team
  • [Article] Comprehensive Comparison of ML Model Monitoring Tools: Evidently AI, Alibi Detect, NannyML, WhyLabs, and Fiddler AI by Tanish Kandivlikar (Medium)
  • [Article] Monitoring AI in Production: Introduction to NannyML by Adnan Karol (Medium)
  • [Article] Data Drift Explainability: Interpretable Shift Detection with NannyML by Marco Cerliani (Towards Data Science)
  • [Article] An End-to-End ML Model Monitoring Workflow with NannyML in Python by Bex Tuychiyev (DataCamp)
  • [Article] Detecting Concept Drift: Impact on Machine Learning Performance by Michal Oleszak (NannyML.Com)
  • [Article] Estimating Model Performance Without Labels by Jakub Białek (NannyML.Com)
  • [Article] Monitoring Workflow for Machine Learning Systems by Santiago Víquez (NannyML.Com)
  • [Article] Don’t Let Yourself Be Fooled by Data Drift by Santiago Víquez (NannyML.Com)
  • [Article] Understanding Data Drift: Impact on Machine Learning Model Performance by Jakub Białek (NannyML.Com)
  • [Article] NannyML’s Guide to Data Quality and Covariate Shift by Magdalena Kowalczuk (NannyML.Com)
  • [Article] From Reactive to Proactive: Shift your ML Monitoring Approach by Qiamo (Luca) Zheng (NannyML.Com)
  • [Article] How to Detect Under-Performing Segments in ML Models by Kavita Rana (NannyML.Com)
  • [Article] Building Custom Metrics for Predictive Maintenance by Kavita Rana(NannyML.Com)
  • [Article] 3 Custom Metrics for Your Forecasting Models by Kavita Rana (NannyML.Com)
  • [Article] There's Data Drift, But Does It Matter? by Santiago Víquez (NannyML.Com)
  • [Article] Monitoring Custom Metrics without Ground Truth by Kavita Rana (NannyML.Com)
  • [Article] Which Multivariate Drift Detection Method Is Right for You: Comparing DRE and DC by Miles Weberman (NannyML.Com)
  • [Article] Prevent Failure of Product Defect Detection Models: A Post-Deployment Guide by Kavita Rana (NannyML.Com)
  • [Article] Common Pitfalls in Monitoring Default Prediction Models and How to Fix Them by Miles Weberman (NannyML.Com)
  • [Article] Why Relying on Training Data for ML Monitoring Can Trick You by Kavita Rana (NannyML.Com)
  • [Article] Estimating Model Performance Without Labels by Jakub Białek (NannyML.Com)
  • [Article] Using Concept Drift as a Model Retraining Trigger by Taliya Weinstein (NannyML.Com)
  • [Article] Retraining is Not All You Need by Miles Weberman (NannyML.Com)
  • [Article] A Comprehensive Guide to Univariate Drift Detection Methods by Kavita Rana (NannyML.Com)
  • [Article] Stress-free Monitoring of Predictive Maintenance Models by Kavita Rana (NannyML.Com)
  • [Article] Effective ML Monitoring: A Hands-on Example by Miles Weberman (NannyML.Com)
  • [Article] Don’t Drift Away with Your Data: Monitoring Data Drift from Setup to Cloud by Taliya Weinstein (NannyML.Com)
  • [Article] Comparing Multivariate Drift Detection Algorithms on Real-World Data by Kavita Rana (NannyML.Com)
  • [Article] Detect Data Drift Using Domain Classifier in Python by Miles Weberman (NannyML.Com)
  • [Article] Guide: How to evaluate if NannyML is the right monitoring tool for you by Santiago Víquez (NannyML.Com)
  • [Article] How To Monitor ML models with NannyML SageMaker Algorithms by Wiljan Cools (NannyML.Com)
  • [Article] Tutorial: Monitoring Missing and Unseen values with NannyML by Santiago Víquez (NannyML.Com)
  • [Article] Monitoring Machine Learning Models: A Fundamental Practice for Data Scientists and Machine Learning Engineers by Saurav Pawar (Medium)
  • [Article] Failure Is Not an Option: How to Prevent Your ML Model From Degradation by Maciej Balawejder (Medium)
  • [Article] Managing Data Drift and Data Distribution Shifts in the MLOps Lifecycle for Machine Learning Models by Abhishek Reddy (Medium)
  • [Article] “You Can’t Predict the Errors of Your Model”… Or Can You? by Samuele Mazzanti (Medium)
  • [Article] Understanding Concept Drift: A Simple Guide by Vitor Cerqueira (Medium)
  • [Article] Detecting Covariate Shift: A Guide to the Multivariate Approach by Michał Oleszak (Medium)
  • [Article] Data Drift vs. Concept Drift: Differences and How to Detect and Address Them by DataHeroes Team (DataHeroes.AI)
  • [Article] An Introduction to Machine Learning Engineering for Production /MLOps — Concept and Data Drifts by Praatibh Surana (Medium)
  • [Article] Concept Drift and Model Decay in Machine Learning by Ashok Chilakapati (Medium)
  • [Article] Data Drift: Types of Data Drift by Numal Jayawardena (Medium)
  • [Article] Monitoring Machine Learning models by Jacques Verre (Medium)
  • [Article] Data drift: It Can Come At You From Anywhere by Tirthajyoti Sarkar (Medium)
  • [Article] Drift in Machine Learning by Piotr (Peter) Mardziel (Medium)
  • [Article] Understanding Dataset Shift by Matthew Stewart (Medium)
  • [Article] Calculating Data Drift in Machine Learning using Python by Vatsal (Medium)
  • [Article] 91% of ML Models Degrade in Time by Santiago Víquez (Medium)
  • [Article] Model Drift in Machine Learning by Kurtis Pykes (Medium)
  • [Article] Production Machine Learning Monitoring: Outliers, Drift, Explainers & Statistical Performance by Alejandro Saucedo (Medium)
  • [Article] How to Detect Model Drift in MLOps Monitoring by Amit Paka (Medium)
  • [Article] “My data drifted. What’s next?” How to handle ML model drift in production. by Elena Samuylova (Medium)
  • [Article] Machine Learning Model Drift by Sophia Yang (Medium)
  • [Article] Estimating the Performance of an ML Model in the Absence of Ground Truth by Eryk Lewinson (Medium)
  • [Article] Ensemble: Boosting, Bagging, and Stacking Machine Learning by Jason Brownlee (MachineLearningMastery.Com)
  • [Article] Stacking Machine Learning: Everything You Need to Know by Ada Parker (MachineLearningPro.Org)
  • [Article] Ensemble Learning: Bagging, Boosting and Stacking by Edouard Duchesnay, Tommy Lofstedt and Feki Younes (Duchesnay.GitHub.IO)
  • [Article] Stack Machine Learning Models: Get Better Results by Casper Hansen (Developer.IBM.Com)
  • [Article] GradientBoosting vs AdaBoost vs XGBoost vs CatBoost vs LightGBM by Geeks for Geeks Team (GeeksForGeeks.Org)
  • [Article] A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning by Jason Brownlee (MachineLearningMastery.Com)
  • [Article] The Ultimate Guide to AdaBoost Algorithm | What is AdaBoost Algorithm? by Ashish Kumar (MyGreatLearning.Com)
  • [Article] A Gentle Introduction to Ensemble Learning Algorithms by Jason Brownlee (MachineLearningMastery.Com)
  • [Article] Ensemble Methods: Elegant Techniques to Produce Improved Machine Learning Results by Necati Demir (Toptal.Com)
  • [Article] The Essential Guide to Ensemble Learning by Rohit Kundu (V7Labs.Com)
  • [Article] Develop an Intuition for How Ensemble Learning Works by by Jason Brownlee (Machine Learning Mastery)
  • [Article] Mastering Ensemble Techniques in Machine Learning: Bagging, Boosting, Bayes Optimal Classifier, and Stacking by Rahul Jain (Medium)
  • [Article] Ensemble Learning: Bagging, Boosting, Stacking by Ayşe Kübra Kuyucu (Medium)
  • [Article] Ensemble: Boosting, Bagging, and Stacking Machine Learning by Aleyna Şenozan (Medium)
  • [Article] Boosting, Stacking, and Bagging for Ensemble Models for Time Series Analysis with Python by Kyle Jones (Medium)
  • [Article] Different types of Ensemble Techniques — Bagging, Boosting, Stacking, Voting, Blending by Abhishek Jain (Medium)
  • [Article] Mastering Ensemble Techniques in Machine Learning: Bagging, Boosting, Bayes Optimal Classifier, and Stacking by Rahul Jain (Medium)
  • [Article] Understanding Ensemble Methods: Bagging, Boosting, and Stacking by Divya bhagat (Medium)
  • [Video Tutorial] Concept Drift Detection with NannyML | Webinar by NannyML (YouTube)
  • [Video Tutorial] Fooled by Data Drift: How to Monitor ML Without False Positives by NannyML (YouTube)
  • [Video Tutorial] Monitoring Custom Metrics Without Access to Targets by NannyML (YouTube)
  • [Video Tutorial] Analyzing Your Model's Performance in Production by NannyML (YouTube)
  • [Video Tutorial] How to Monitor Predictive Maintenance Models | Webinar Replay by NannyML (YouTube)
  • [Video Tutorial] Machine Learning Monitoring Workflow [Webinar] by NannyML (YouTube)
  • [Video Tutorial] Monitoring Machine Learning Models on AWS | Webinar by NannyML (YouTube)
  • [Video Tutorial] Root Cause Analysis for ML Model Failure by NannyML (YouTube)
  • [Video Tutorial] Quantifying the Impact of Data Drift on Machine Learning Model Performance | Webinar by NannyML (YouTube)
  • [Video Tutorial] How to Detect Drift and Resolve Issues in Your Machine Learning Models? by NannyML (YouTube)
  • [Video Tutorial] Notebooks to Containers: Setting up Continuous (ML) Model Monitoring in Production by NannyML (YouTube)
  • [Video Tutorial] Performance Estimation using NannyML | Tutorial in Jupyter Notebook by NannyML (YouTube)
  • [Video Tutorial] What Is NannyML? Introducing Our Open Source Python Library by NannyML (YouTube)
  • [Video Tutorial] How to Automatically Retrain Your Models with Concept Drift Detection? by NannyML (YouTube)
  • [Video Tutorial] How to Use NannyML? Two Modes of Running Our Library by NannyML (YouTube)
  • [Video Tutorial] How to Integrate NannyML in Production? | Tutorial by NannyML (YouTube)
  • [Video Tutorial] Bringing Your Machine Learning Model to Production | Overview by NannyML (YouTube)
  • [Video Tutorial] Notebooks to Containers: Setting Up Continuous (ML) Model Monitoring in Production by NannyML (YouTube)
  • [Video Tutorial] ML Performance without Labels: Comparing Performance Estimation Methods (Webinar Replay) by NannyML (YouTube)
  • [Course] DataCamp Python Data Analyst Certificate by DataCamp Team (DataCamp)
  • [Course] DataCamp Python Associate Data Scientist Certificate by DataCamp Team (DataCamp)
  • [Course] DataCamp Python Data Scientist Certificate by DataCamp Team (DataCamp)
  • [Course] DataCamp Machine Learning Engineer Certificate by DataCamp Team (DataCamp)
  • [Course] DataCamp Machine Learning Scientist Certificate by DataCamp Team (DataCamp)
  • [Course] IBM Data Analyst Professional Certificate by IBM Team (Coursera)
  • [Course] IBM Data Science Professional Certificate by IBM Team (Coursera)
  • [Course] IBM Machine Learning Professional Certificate by IBM Team (Coursera)