Supervised Learning : Comparing Optimization Algorithms in Parameter Updates and Loss Function Minimization for Neural Network Classification¶
- 1. Table of Contents
- 1.1 Data Background
- 1.2 Data Description
- 1.3 Data Quality Assessment
- 1.4 Data Preprocessing
- 1.5 Data Exploration
- 1.6 Neural Network Classification Gradient and Weight Updates
- 1.6.1 Premodelling Data Description
- 1.6.2 Stochastic Gradient Descent Optimization
- 1.6.3 Adaptive Moment Estimation Optimization
- 1.6.4 Adaptive Gradient Algorithm Optimization
- 1.6.5 AdaDelta Optimization
- 1.6.6 Layer-wise Optimized Non-convex Optimization
- 1.6.7 Root Mean Square Propagation Optimization
- 1.7 Consolidated Findings
- 2. Summary
- 3. References
1. Table of Contents ¶
This project manually implements the Stochastic Gradient Descent Optimization, Adaptive Moment Estimation Optimization, Adaptive Gradient Optimization, AdaDelta Optimization, Layer-wise Optimized Non-convex Optimization and Root Mean Square Propagation Optimization algorithms using various helpful packages in Python with fixed values applied for the learning rate and iteration count parameters to optimally update the gradients and weights of an artificial neural network classification model. The cost function and classification accuracy optimization profiles of the different optimization algorithms were compared. All results were consolidated in a Summary presented at the end of the document.
Artificial Neural Network, in the context of categorical response prediction, consists of interconnected nodes called neurons organized in layers. The model architecture involves an input layer which receives the input data, with each neuron representing a feature or attribute of the data; hidden layers which perform computations on the input data through weighted connections between neurons and apply activation functions to produce outputs; and the output layer which produces the final predictions equal to the number of classes, each representing the probability of the input belonging to a particular class, based on the computations performed in the hidden layers. Neurons within adjacent layers are connected by weighted connections. Each connection has an associated weight that determines the strength of influence one neuron has on another. These weights are adjusted during the training process to enable the network to learn from the input data and make accurate predictions. Activation functions introduce non-linearities into the network, allowing it to learn complex relationships between inputs and outputs. The training process involves presenting input data along with corresponding target outputs to the network and adjusting the weights to minimize the difference between the predicted outputs and the actual targets which is typically performed through optimization algorithms such as gradient descent and backpropagation. The training process iteratively updates the weights until the model's predictions closely match the target outputs.
Backpropagation and Weight Update, in the context of an artificial neural network, involve the process of iteratively adjusting the weights of the connections between neurons in the network to minimize the difference between the predicted and the actual target responses. Input data is fed into the neural network, and it propagates through the network layer by layer, starting from the input layer, through hidden layers, and ending at the output layer. At each neuron, the weighted sum of inputs is calculated, followed by the application of an activation function to produce the neuron's output. Once the forward pass is complete, the network's output is compared to the actual target output. The difference between the predicted output and the actual output is quantified using a loss function, which measures the discrepancy between the predicted and actual values. Common loss functions for classification tasks include cross-entropy loss. During the backward pass, the error is propagated backward through the network to compute the gradients of the loss function with respect to each weight in the network. This is achieved using the chain rule of calculus, which allows the error to be decomposed and distributed backward through the network. The gradients quantify how much a change in each weight would affect the overall error of the network. Once the gradients are computed, the weights are updated in the opposite direction of the gradient to minimize the error. This update is typically performed using an optimization algorithm such as gradient descent, which adjusts the weights in proportion to their gradients and a learning rate hyperparameter. The learning rate determines the size of the step taken in the direction opposite to the gradient. These steps are repeated for multiple iterations (epochs) over the training data. As the training progresses, the weights are adjusted iteratively to minimize the error, leading to a neural network model that accurately classifies input data.
Optimization Algorithms, in the context of neural network classification, are methods used to adjust the parameters (weights and biases) of a neural network during the training process in order to minimize a predefined loss function. The primary goal of these algorithms is to optimize the performance of the neural network by iteratively updating its parameters based on the feedback provided by the training data. Optimization algorithms play a critical role in the training of neural networks because they determine how effectively the network learns from the data and how quickly it converges to an optimal solution. These algorithms are significant during model development in improving model accuracy (optimization algorithms help improve the accuracy of neural network models by minimizing the classification error on the training data), enhancing generalization (by minimizing the loss function during training, optimization algorithms aim to generalize well to unseen data, thereby improving the model's ability to make accurate predictions on new inputs), reducing training time (efficient optimization algorithms can accelerate the convergence of the training process, leading to shorter training times for neural networks), handling complex data (since neural networks often deal with high-dimensional and non-linear data, optimization algorithms enable neural networks to effectively learn complex patterns and relationships within the data, leading to improved classification performance) and adapting to variations in data (optimization algorithms can adapt the model's parameters based on variations in the training data, ensuring robustness and stability in the face of different input distributions or data characteristics).
1.1. Data Background ¶
Datasets used for the analysis were separately gathered and consolidated from various sources including:
- Cancer Rates from World Population Review
- Social Protection and Labor Indicator from World Bank
- Education Indicator from World Bank
- Economy and Growth Indicator from World Bank
- Environment Indicator from World Bank
- Climate Change Indicator from World Bank
- Agricultural and Rural Development Indicator from World Bank
- Social Development Indicator from World Bank
- Health Indicator from World Bank
- Science and Technology Indicator from World Bank
- Urban Development Indicator from World Bank
- Human Development Indices from Human Development Reports
- Environmental Performance Indices from Yale Center for Environmental Law and Policy
This study hypothesized that various global development indicators and indices influence cancer rates across countries.
The target variable for the study is:
- CANRAT - Dichotomized category based on age-standardized cancer rates, per 100K population (2022)
The predictor variables for the study are:
- GDPPER - GDP per person employed, current US Dollars (2020)
- URBPOP - Urban population, % of total population (2020)
- PATRES - Patent applications by residents, total count (2020)
- RNDGDP - Research and development expenditure, % of GDP (2020)
- POPGRO - Population growth, annual % (2020)
- LIFEXP - Life expectancy at birth, total in years (2020)
- TUBINC - Incidence of tuberculosis, per 100K population (2020)
- DTHCMD - Cause of death by communicable diseases and maternal, prenatal and nutrition conditions, % of total (2019)
- AGRLND - Agricultural land, % of land area (2020)
- GHGEMI - Total greenhouse gas emissions, kt of CO2 equivalent (2020)
- RELOUT - Renewable electricity output, % of total electricity output (2015)
- METEMI - Methane emissions, kt of CO2 equivalent (2020)
- FORARE - Forest area, % of land area (2020)
- CO2EMI - CO2 emissions, metric tons per capita (2020)
- PM2EXP - PM2.5 air pollution, population exposed to levels exceeding WHO guideline value, % of total (2017)
- POPDEN - Population density, people per sq. km of land area (2020)
- GDPCAP - GDP per capita, current US Dollars (2020)
- ENRTER - Tertiary school enrollment, % gross (2020)
- HDICAT - Human development index, ordered category (2020)
- EPISCO - Environment performance index , score (2022)
1.2. Data Description ¶
- The dataset is comprised of:
- 177 rows (observations)
- 22 columns (variables)
- 1/22 metadata (object)
- COUNTRY
- 1/22 target (categorical)
- CANRAT
- 19/22 predictor (numeric)
- GDPPER
- URBPOP
- PATRES
- RNDGDP
- POPGRO
- LIFEXP
- TUBINC
- DTHCMD
- AGRLND
- GHGEMI
- RELOUT
- METEMI
- FORARE
- CO2EMI
- PM2EXP
- POPDEN
- GDPCAP
- ENRTER
- EPISCO
- 1/22 predictor (categorical)
- HDICAT
- 1/22 metadata (object)
##################################
# Loading Python Libraries
##################################
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import itertools
import os
%matplotlib inline
from operator import add,mul,truediv
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PowerTransformer
from sklearn.preprocessing import StandardScaler
from scipy import stats
##################################
# Defining file paths
##################################
DATASETS_ORIGINAL_PATH = r"datasets\original"
##################################
# Loading the dataset
# from the DATASETS_ORIGINAL_PATH
##################################
cancer_rate = pd.read_csv(os.path.join("..", DATASETS_ORIGINAL_PATH, "CategoricalCancerRates.csv"))
##################################
# Performing a general exploration of the dataset
##################################
print('Dataset Dimensions: ')
display(cancer_rate.shape)
Dataset Dimensions:
(177, 22)
##################################
# Listing the column names and data types
##################################
print('Column Names and Data Types:')
display(cancer_rate.dtypes)
Column Names and Data Types:
COUNTRY object CANRAT object GDPPER float64 URBPOP float64 PATRES float64 RNDGDP float64 POPGRO float64 LIFEXP float64 TUBINC float64 DTHCMD float64 AGRLND float64 GHGEMI float64 RELOUT float64 METEMI float64 FORARE float64 CO2EMI float64 PM2EXP float64 POPDEN float64 ENRTER float64 GDPCAP float64 HDICAT object EPISCO float64 dtype: object
##################################
# Taking a snapshot of the dataset
##################################
cancer_rate.head()
COUNTRY | CANRAT | GDPPER | URBPOP | PATRES | RNDGDP | POPGRO | LIFEXP | TUBINC | DTHCMD | ... | RELOUT | METEMI | FORARE | CO2EMI | PM2EXP | POPDEN | ENRTER | GDPCAP | HDICAT | EPISCO | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Australia | High | 98380.63601 | 86.241 | 2368.0 | NaN | 1.235701 | 83.200000 | 7.2 | 4.941054 | ... | 13.637841 | 131484.763200 | 17.421315 | 14.772658 | 24.893584 | 3.335312 | 110.139221 | 51722.06900 | VH | 60.1 |
1 | New Zealand | High | 77541.76438 | 86.699 | 348.0 | NaN | 2.204789 | 82.256098 | 7.2 | 4.354730 | ... | 80.081439 | 32241.937000 | 37.570126 | 6.160799 | NaN | 19.331586 | 75.734833 | 41760.59478 | VH | 56.7 |
2 | Ireland | High | 198405.87500 | 63.653 | 75.0 | 1.23244 | 1.029111 | 82.556098 | 5.3 | 5.684596 | ... | 27.965408 | 15252.824630 | 11.351720 | 6.768228 | 0.274092 | 72.367281 | 74.680313 | 85420.19086 | VH | 57.4 |
3 | United States | High | 130941.63690 | 82.664 | 269586.0 | 3.42287 | 0.964348 | 76.980488 | 2.3 | 5.302060 | ... | 13.228593 | 748241.402900 | 33.866926 | 13.032828 | 3.343170 | 36.240985 | 87.567657 | 63528.63430 | VH | 51.1 |
4 | Denmark | High | 113300.60110 | 88.116 | 1261.0 | 2.96873 | 0.291641 | 81.602439 | 4.1 | 6.826140 | ... | 65.505925 | 7778.773921 | 15.711000 | 4.691237 | 56.914456 | 145.785100 | 82.664330 | 60915.42440 | VH | 77.9 |
5 rows × 22 columns
##################################
# Setting the levels of the categorical variables
##################################
cancer_rate['CANRAT'] = cancer_rate['CANRAT'].astype('category')
cancer_rate['CANRAT'] = cancer_rate['CANRAT'].cat.set_categories(['Low', 'High'], ordered=True)
cancer_rate['HDICAT'] = cancer_rate['HDICAT'].astype('category')
cancer_rate['HDICAT'] = cancer_rate['HDICAT'].cat.set_categories(['L', 'M', 'H', 'VH'], ordered=True)
##################################
# Performing a general exploration of the numeric variables
##################################
print('Numeric Variable Summary:')
display(cancer_rate.describe(include='number').transpose())
Numeric Variable Summary:
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
GDPPER | 165.0 | 45284.424283 | 3.941794e+04 | 1718.804896 | 13545.254510 | 34024.900890 | 66778.416050 | 2.346469e+05 |
URBPOP | 174.0 | 59.788121 | 2.280640e+01 | 13.345000 | 42.432750 | 61.701500 | 79.186500 | 1.000000e+02 |
PATRES | 108.0 | 20607.388889 | 1.340683e+05 | 1.000000 | 35.250000 | 244.500000 | 1297.750000 | 1.344817e+06 |
RNDGDP | 74.0 | 1.197474 | 1.189956e+00 | 0.039770 | 0.256372 | 0.873660 | 1.608842 | 5.354510e+00 |
POPGRO | 174.0 | 1.127028 | 1.197718e+00 | -2.079337 | 0.236900 | 1.179959 | 2.031154 | 3.727101e+00 |
LIFEXP | 174.0 | 71.746113 | 7.606209e+00 | 52.777000 | 65.907500 | 72.464610 | 77.523500 | 8.456000e+01 |
TUBINC | 174.0 | 105.005862 | 1.367229e+02 | 0.770000 | 12.000000 | 44.500000 | 147.750000 | 5.920000e+02 |
DTHCMD | 170.0 | 21.260521 | 1.927333e+01 | 1.283611 | 6.078009 | 12.456279 | 36.980457 | 6.520789e+01 |
AGRLND | 174.0 | 38.793456 | 2.171551e+01 | 0.512821 | 20.130276 | 40.386649 | 54.013754 | 8.084112e+01 |
GHGEMI | 170.0 | 259582.709895 | 1.118550e+06 | 179.725150 | 12527.487367 | 41009.275980 | 116482.578575 | 1.294287e+07 |
RELOUT | 153.0 | 39.760036 | 3.191492e+01 | 0.000296 | 10.582691 | 32.381668 | 63.011450 | 1.000000e+02 |
METEMI | 170.0 | 47876.133575 | 1.346611e+05 | 11.596147 | 3662.884908 | 11118.976025 | 32368.909040 | 1.186285e+06 |
FORARE | 173.0 | 32.218177 | 2.312001e+01 | 0.008078 | 11.604388 | 31.509048 | 49.071780 | 9.741212e+01 |
CO2EMI | 170.0 | 3.751097 | 4.606479e+00 | 0.032585 | 0.631924 | 2.298368 | 4.823496 | 3.172684e+01 |
PM2EXP | 167.0 | 91.940595 | 2.206003e+01 | 0.274092 | 99.627134 | 100.000000 | 100.000000 | 1.000000e+02 |
POPDEN | 174.0 | 200.886765 | 6.453834e+02 | 2.115134 | 27.454539 | 77.983133 | 153.993650 | 7.918951e+03 |
ENRTER | 116.0 | 49.994997 | 2.970619e+01 | 2.432581 | 22.107195 | 53.392460 | 71.057467 | 1.433107e+02 |
GDPCAP | 170.0 | 13992.095610 | 1.957954e+04 | 216.827417 | 1870.503029 | 5348.192875 | 17421.116227 | 1.173705e+05 |
EPISCO | 165.0 | 42.946667 | 1.249086e+01 | 18.900000 | 33.000000 | 40.900000 | 50.500000 | 7.790000e+01 |
##################################
# Performing a general exploration of the object variable
##################################
print('Object Variable Summary:')
display(cancer_rate.describe(include='object').transpose())
Object Variable Summary:
count | unique | top | freq | |
---|---|---|---|---|
COUNTRY | 177 | 177 | Australia | 1 |
##################################
# Performing a general exploration of the categorical variables
##################################
print('Categorical Variable Summary:')
display(cancer_rate.describe(include='category').transpose())
Categorical Variable Summary:
count | unique | top | freq | |
---|---|---|---|---|
CANRAT | 177 | 2 | Low | 132 |
HDICAT | 167 | 4 | VH | 59 |
1.3. Data Quality Assessment ¶
Data quality findings based on assessment are as follows:
- No duplicated rows observed.
- Missing data noted for 20 variables with Null.Count>0 and Fill.Rate<1.0.
- RNDGDP: Null.Count = 103, Fill.Rate = 0.418
- PATRES: Null.Count = 69, Fill.Rate = 0.610
- ENRTER: Null.Count = 61, Fill.Rate = 0.655
- RELOUT: Null.Count = 24, Fill.Rate = 0.864
- GDPPER: Null.Count = 12, Fill.Rate = 0.932
- EPISCO: Null.Count = 12, Fill.Rate = 0.932
- HDICAT: Null.Count = 10, Fill.Rate = 0.943
- PM2EXP: Null.Count = 10, Fill.Rate = 0.943
- DTHCMD: Null.Count = 7, Fill.Rate = 0.960
- METEMI: Null.Count = 7, Fill.Rate = 0.960
- CO2EMI: Null.Count = 7, Fill.Rate = 0.960
- GDPCAP: Null.Count = 7, Fill.Rate = 0.960
- GHGEMI: Null.Count = 7, Fill.Rate = 0.960
- FORARE: Null.Count = 4, Fill.Rate = 0.977
- TUBINC: Null.Count = 3, Fill.Rate = 0.983
- AGRLND: Null.Count = 3, Fill.Rate = 0.983
- POPGRO: Null.Count = 3, Fill.Rate = 0.983
- POPDEN: Null.Count = 3, Fill.Rate = 0.983
- URBPOP: Null.Count = 3, Fill.Rate = 0.983
- LIFEXP: Null.Count = 3, Fill.Rate = 0.983
- 120 observations noted with at least 1 missing data. From this number, 14 observations reported high Missing.Rate>0.2.
- COUNTRY=Guadeloupe: Missing.Rate= 0.909
- COUNTRY=Martinique: Missing.Rate= 0.909
- COUNTRY=French Guiana: Missing.Rate= 0.909
- COUNTRY=New Caledonia: Missing.Rate= 0.500
- COUNTRY=French Polynesia: Missing.Rate= 0.500
- COUNTRY=Guam: Missing.Rate= 0.500
- COUNTRY=Puerto Rico: Missing.Rate= 0.409
- COUNTRY=North Korea: Missing.Rate= 0.227
- COUNTRY=Somalia: Missing.Rate= 0.227
- COUNTRY=South Sudan: Missing.Rate= 0.227
- COUNTRY=Venezuela: Missing.Rate= 0.227
- COUNTRY=Libya: Missing.Rate= 0.227
- COUNTRY=Eritrea: Missing.Rate= 0.227
- COUNTRY=Yemen: Missing.Rate= 0.227
- Low variance observed for 1 variable with First.Second.Mode.Ratio>5.
- PM2EXP: First.Second.Mode.Ratio = 53.000
- No low variance observed for any variable with Unique.Count.Ratio>10.
- High skewness observed for 5 variables with Skewness>3 or Skewness<(-3).
- POPDEN: Skewness = +10.267
- GHGEMI: Skewness = +9.496
- PATRES: Skewness = +9.284
- METEMI: Skewness = +5.801
- PM2EXP: Skewness = -3.141
##################################
# Counting the number of duplicated rows
##################################
cancer_rate.duplicated().sum()
np.int64(0)
##################################
# Gathering the data types for each column
##################################
data_type_list = list(cancer_rate.dtypes)
##################################
# Gathering the variable names for each column
##################################
variable_name_list = list(cancer_rate.columns)
##################################
# Gathering the number of observations for each column
##################################
row_count_list = list([len(cancer_rate)] * len(cancer_rate.columns))
##################################
# Gathering the number of missing data for each column
##################################
null_count_list = list(cancer_rate.isna().sum(axis=0))
##################################
# Gathering the number of non-missing data for each column
##################################
non_null_count_list = list(cancer_rate.count())
##################################
# Gathering the missing data percentage for each column
##################################
fill_rate_list = map(truediv, non_null_count_list, row_count_list)
##################################
# Formulating the summary
# for all columns
##################################
all_column_quality_summary = pd.DataFrame(zip(variable_name_list,
data_type_list,
row_count_list,
non_null_count_list,
null_count_list,
fill_rate_list),
columns=['Column.Name',
'Column.Type',
'Row.Count',
'Non.Null.Count',
'Null.Count',
'Fill.Rate'])
display(all_column_quality_summary)
Column.Name | Column.Type | Row.Count | Non.Null.Count | Null.Count | Fill.Rate | |
---|---|---|---|---|---|---|
0 | COUNTRY | object | 177 | 177 | 0 | 1.000000 |
1 | CANRAT | category | 177 | 177 | 0 | 1.000000 |
2 | GDPPER | float64 | 177 | 165 | 12 | 0.932203 |
3 | URBPOP | float64 | 177 | 174 | 3 | 0.983051 |
4 | PATRES | float64 | 177 | 108 | 69 | 0.610169 |
5 | RNDGDP | float64 | 177 | 74 | 103 | 0.418079 |
6 | POPGRO | float64 | 177 | 174 | 3 | 0.983051 |
7 | LIFEXP | float64 | 177 | 174 | 3 | 0.983051 |
8 | TUBINC | float64 | 177 | 174 | 3 | 0.983051 |
9 | DTHCMD | float64 | 177 | 170 | 7 | 0.960452 |
10 | AGRLND | float64 | 177 | 174 | 3 | 0.983051 |
11 | GHGEMI | float64 | 177 | 170 | 7 | 0.960452 |
12 | RELOUT | float64 | 177 | 153 | 24 | 0.864407 |
13 | METEMI | float64 | 177 | 170 | 7 | 0.960452 |
14 | FORARE | float64 | 177 | 173 | 4 | 0.977401 |
15 | CO2EMI | float64 | 177 | 170 | 7 | 0.960452 |
16 | PM2EXP | float64 | 177 | 167 | 10 | 0.943503 |
17 | POPDEN | float64 | 177 | 174 | 3 | 0.983051 |
18 | ENRTER | float64 | 177 | 116 | 61 | 0.655367 |
19 | GDPCAP | float64 | 177 | 170 | 7 | 0.960452 |
20 | HDICAT | category | 177 | 167 | 10 | 0.943503 |
21 | EPISCO | float64 | 177 | 165 | 12 | 0.932203 |
##################################
# Counting the number of columns
# with Fill.Rate < 1.00
##################################
len(all_column_quality_summary[(all_column_quality_summary['Fill.Rate']<1)])
20
##################################
# Identifying the columns
# with Fill.Rate < 1.00
##################################
display(all_column_quality_summary[(all_column_quality_summary['Fill.Rate']<1)].sort_values(by=['Fill.Rate'], ascending=True))
Column.Name | Column.Type | Row.Count | Non.Null.Count | Null.Count | Fill.Rate | |
---|---|---|---|---|---|---|
5 | RNDGDP | float64 | 177 | 74 | 103 | 0.418079 |
4 | PATRES | float64 | 177 | 108 | 69 | 0.610169 |
18 | ENRTER | float64 | 177 | 116 | 61 | 0.655367 |
12 | RELOUT | float64 | 177 | 153 | 24 | 0.864407 |
21 | EPISCO | float64 | 177 | 165 | 12 | 0.932203 |
2 | GDPPER | float64 | 177 | 165 | 12 | 0.932203 |
16 | PM2EXP | float64 | 177 | 167 | 10 | 0.943503 |
20 | HDICAT | category | 177 | 167 | 10 | 0.943503 |
15 | CO2EMI | float64 | 177 | 170 | 7 | 0.960452 |
13 | METEMI | float64 | 177 | 170 | 7 | 0.960452 |
11 | GHGEMI | float64 | 177 | 170 | 7 | 0.960452 |
9 | DTHCMD | float64 | 177 | 170 | 7 | 0.960452 |
19 | GDPCAP | float64 | 177 | 170 | 7 | 0.960452 |
14 | FORARE | float64 | 177 | 173 | 4 | 0.977401 |
6 | POPGRO | float64 | 177 | 174 | 3 | 0.983051 |
3 | URBPOP | float64 | 177 | 174 | 3 | 0.983051 |
17 | POPDEN | float64 | 177 | 174 | 3 | 0.983051 |
10 | AGRLND | float64 | 177 | 174 | 3 | 0.983051 |
7 | LIFEXP | float64 | 177 | 174 | 3 | 0.983051 |
8 | TUBINC | float64 | 177 | 174 | 3 | 0.983051 |
##################################
# Identifying the rows
# with Fill.Rate < 0.90
##################################
column_low_fill_rate = all_column_quality_summary[(all_column_quality_summary['Fill.Rate']<0.90)]
##################################
# Gathering the metadata labels for each observation
##################################
row_metadata_list = cancer_rate["COUNTRY"].values.tolist()
##################################
# Gathering the number of columns for each observation
##################################
column_count_list = list([len(cancer_rate.columns)] * len(cancer_rate))
##################################
# Gathering the number of missing data for each row
##################################
null_row_list = list(cancer_rate.isna().sum(axis=1))
##################################
# Gathering the missing data percentage for each column
##################################
missing_rate_list = map(truediv, null_row_list, column_count_list)
##################################
# Identifying the rows
# with missing data
##################################
all_row_quality_summary = pd.DataFrame(zip(row_metadata_list,
column_count_list,
null_row_list,
missing_rate_list),
columns=['Row.Name',
'Column.Count',
'Null.Count',
'Missing.Rate'])
display(all_row_quality_summary)
Row.Name | Column.Count | Null.Count | Missing.Rate | |
---|---|---|---|---|
0 | Australia | 22 | 1 | 0.045455 |
1 | New Zealand | 22 | 2 | 0.090909 |
2 | Ireland | 22 | 0 | 0.000000 |
3 | United States | 22 | 0 | 0.000000 |
4 | Denmark | 22 | 0 | 0.000000 |
... | ... | ... | ... | ... |
172 | Congo Republic | 22 | 3 | 0.136364 |
173 | Bhutan | 22 | 2 | 0.090909 |
174 | Nepal | 22 | 2 | 0.090909 |
175 | Gambia | 22 | 4 | 0.181818 |
176 | Niger | 22 | 2 | 0.090909 |
177 rows × 4 columns
##################################
# Counting the number of rows
# with Missing.Rate > 0.00
##################################
len(all_row_quality_summary[(all_row_quality_summary['Missing.Rate']>0.00)])
120
##################################
# Counting the number of rows
# with Missing.Rate > 0.20
##################################
len(all_row_quality_summary[(all_row_quality_summary['Missing.Rate']>0.20)])
14
##################################
# Identifying the rows
# with Missing.Rate > 0.20
##################################
row_high_missing_rate = all_row_quality_summary[(all_row_quality_summary['Missing.Rate']>0.20)]
##################################
# Identifying the rows
# with Missing.Rate > 0.20
##################################
display(all_row_quality_summary[(all_row_quality_summary['Missing.Rate']>0.20)].sort_values(by=['Missing.Rate'], ascending=False))
Row.Name | Column.Count | Null.Count | Missing.Rate | |
---|---|---|---|---|
35 | Guadeloupe | 22 | 20 | 0.909091 |
39 | Martinique | 22 | 20 | 0.909091 |
56 | French Guiana | 22 | 20 | 0.909091 |
13 | New Caledonia | 22 | 11 | 0.500000 |
44 | French Polynesia | 22 | 11 | 0.500000 |
75 | Guam | 22 | 11 | 0.500000 |
53 | Puerto Rico | 22 | 9 | 0.409091 |
85 | North Korea | 22 | 6 | 0.272727 |
168 | South Sudan | 22 | 6 | 0.272727 |
132 | Somalia | 22 | 6 | 0.272727 |
117 | Libya | 22 | 5 | 0.227273 |
73 | Venezuela | 22 | 5 | 0.227273 |
161 | Eritrea | 22 | 5 | 0.227273 |
164 | Yemen | 22 | 5 | 0.227273 |
##################################
# Formulating the dataset
# with numeric columns only
##################################
cancer_rate_numeric = cancer_rate.select_dtypes(include='number')
##################################
# Gathering the variable names for each numeric column
##################################
numeric_variable_name_list = cancer_rate_numeric.columns
##################################
# Gathering the minimum value for each numeric column
##################################
numeric_minimum_list = cancer_rate_numeric.min()
##################################
# Gathering the mean value for each numeric column
##################################
numeric_mean_list = cancer_rate_numeric.mean()
##################################
# Gathering the median value for each numeric column
##################################
numeric_median_list = cancer_rate_numeric.median()
##################################
# Gathering the maximum value for each numeric column
##################################
numeric_maximum_list = cancer_rate_numeric.max()
##################################
# Gathering the first mode values for each numeric column
##################################
numeric_first_mode_list = [cancer_rate[x].value_counts(dropna=True).index.tolist()[0] for x in cancer_rate_numeric]
##################################
# Gathering the second mode values for each numeric column
##################################
numeric_second_mode_list = [cancer_rate[x].value_counts(dropna=True).index.tolist()[1] for x in cancer_rate_numeric]
##################################
# Gathering the count of first mode values for each numeric column
##################################
numeric_first_mode_count_list = [cancer_rate_numeric[x].isin([cancer_rate[x].value_counts(dropna=True).index.tolist()[0]]).sum() for x in cancer_rate_numeric]
##################################
# Gathering the count of second mode values for each numeric column
##################################
numeric_second_mode_count_list = [cancer_rate_numeric[x].isin([cancer_rate[x].value_counts(dropna=True).index.tolist()[1]]).sum() for x in cancer_rate_numeric]
##################################
# Gathering the first mode to second mode ratio for each numeric column
##################################
numeric_first_second_mode_ratio_list = map(truediv, numeric_first_mode_count_list, numeric_second_mode_count_list)
##################################
# Gathering the count of unique values for each numeric column
##################################
numeric_unique_count_list = cancer_rate_numeric.nunique(dropna=True)
##################################
# Gathering the number of observations for each numeric column
##################################
numeric_row_count_list = list([len(cancer_rate_numeric)] * len(cancer_rate_numeric.columns))
##################################
# Gathering the unique to count ratio for each numeric column
##################################
numeric_unique_count_ratio_list = map(truediv, numeric_unique_count_list, numeric_row_count_list)
##################################
# Gathering the skewness value for each numeric column
##################################
numeric_skewness_list = cancer_rate_numeric.skew()
##################################
# Gathering the kurtosis value for each numeric column
##################################
numeric_kurtosis_list = cancer_rate_numeric.kurtosis()
numeric_column_quality_summary = pd.DataFrame(zip(numeric_variable_name_list,
numeric_minimum_list,
numeric_mean_list,
numeric_median_list,
numeric_maximum_list,
numeric_first_mode_list,
numeric_second_mode_list,
numeric_first_mode_count_list,
numeric_second_mode_count_list,
numeric_first_second_mode_ratio_list,
numeric_unique_count_list,
numeric_row_count_list,
numeric_unique_count_ratio_list,
numeric_skewness_list,
numeric_kurtosis_list),
columns=['Numeric.Column.Name',
'Minimum',
'Mean',
'Median',
'Maximum',
'First.Mode',
'Second.Mode',
'First.Mode.Count',
'Second.Mode.Count',
'First.Second.Mode.Ratio',
'Unique.Count',
'Row.Count',
'Unique.Count.Ratio',
'Skewness',
'Kurtosis'])
display(numeric_column_quality_summary)
Numeric.Column.Name | Minimum | Mean | Median | Maximum | First.Mode | Second.Mode | First.Mode.Count | Second.Mode.Count | First.Second.Mode.Ratio | Unique.Count | Row.Count | Unique.Count.Ratio | Skewness | Kurtosis | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | GDPPER | 1718.804896 | 45284.424283 | 34024.900890 | 2.346469e+05 | 98380.636010 | 77541.764380 | 1 | 1 | 1.000000 | 165 | 177 | 0.932203 | 1.517574 | 3.471992 |
1 | URBPOP | 13.345000 | 59.788121 | 61.701500 | 1.000000e+02 | 100.000000 | 86.699000 | 2 | 1 | 2.000000 | 173 | 177 | 0.977401 | -0.210702 | -0.962847 |
2 | PATRES | 1.000000 | 20607.388889 | 244.500000 | 1.344817e+06 | 6.000000 | 2.000000 | 4 | 3 | 1.333333 | 97 | 177 | 0.548023 | 9.284436 | 91.187178 |
3 | RNDGDP | 0.039770 | 1.197474 | 0.873660 | 5.354510e+00 | 1.232440 | 3.422870 | 1 | 1 | 1.000000 | 74 | 177 | 0.418079 | 1.396742 | 1.695957 |
4 | POPGRO | -2.079337 | 1.127028 | 1.179959 | 3.727101e+00 | 1.235701 | 2.204789 | 1 | 1 | 1.000000 | 174 | 177 | 0.983051 | -0.195161 | -0.423580 |
5 | LIFEXP | 52.777000 | 71.746113 | 72.464610 | 8.456000e+01 | 83.200000 | 82.256098 | 1 | 1 | 1.000000 | 174 | 177 | 0.983051 | -0.357965 | -0.649601 |
6 | TUBINC | 0.770000 | 105.005862 | 44.500000 | 5.920000e+02 | 12.000000 | 4.100000 | 4 | 3 | 1.333333 | 131 | 177 | 0.740113 | 1.746333 | 2.429368 |
7 | DTHCMD | 1.283611 | 21.260521 | 12.456279 | 6.520789e+01 | 4.941054 | 4.354730 | 1 | 1 | 1.000000 | 170 | 177 | 0.960452 | 0.900509 | -0.691541 |
8 | AGRLND | 0.512821 | 38.793456 | 40.386649 | 8.084112e+01 | 46.252480 | 38.562911 | 1 | 1 | 1.000000 | 174 | 177 | 0.983051 | 0.074000 | -0.926249 |
9 | GHGEMI | 179.725150 | 259582.709895 | 41009.275980 | 1.294287e+07 | 571903.119900 | 80158.025830 | 1 | 1 | 1.000000 | 170 | 177 | 0.960452 | 9.496120 | 101.637308 |
10 | RELOUT | 0.000296 | 39.760036 | 32.381668 | 1.000000e+02 | 100.000000 | 80.081439 | 3 | 1 | 3.000000 | 151 | 177 | 0.853107 | 0.501088 | -0.981774 |
11 | METEMI | 11.596147 | 47876.133575 | 11118.976025 | 1.186285e+06 | 131484.763200 | 32241.937000 | 1 | 1 | 1.000000 | 170 | 177 | 0.960452 | 5.801014 | 38.661386 |
12 | FORARE | 0.008078 | 32.218177 | 31.509048 | 9.741212e+01 | 17.421315 | 37.570126 | 1 | 1 | 1.000000 | 173 | 177 | 0.977401 | 0.519277 | -0.322589 |
13 | CO2EMI | 0.032585 | 3.751097 | 2.298368 | 3.172684e+01 | 14.772658 | 6.160799 | 1 | 1 | 1.000000 | 170 | 177 | 0.960452 | 2.721552 | 10.311574 |
14 | PM2EXP | 0.274092 | 91.940595 | 100.000000 | 1.000000e+02 | 100.000000 | 100.000000 | 106 | 2 | 53.000000 | 61 | 177 | 0.344633 | -3.141557 | 9.032386 |
15 | POPDEN | 2.115134 | 200.886765 | 77.983133 | 7.918951e+03 | 3.335312 | 19.331586 | 1 | 1 | 1.000000 | 174 | 177 | 0.983051 | 10.267750 | 119.995256 |
16 | ENRTER | 2.432581 | 49.994997 | 53.392460 | 1.433107e+02 | 110.139221 | 75.734833 | 1 | 1 | 1.000000 | 116 | 177 | 0.655367 | 0.275863 | -0.392895 |
17 | GDPCAP | 216.827417 | 13992.095610 | 5348.192875 | 1.173705e+05 | 51722.069000 | 41760.594780 | 1 | 1 | 1.000000 | 170 | 177 | 0.960452 | 2.258568 | 5.938690 |
18 | EPISCO | 18.900000 | 42.946667 | 40.900000 | 7.790000e+01 | 29.600000 | 43.600000 | 3 | 3 | 1.000000 | 137 | 177 | 0.774011 | 0.641799 | 0.035208 |
##################################
# Counting the number of numeric columns
# with First.Second.Mode.Ratio > 5.00
##################################
len(numeric_column_quality_summary[(numeric_column_quality_summary['First.Second.Mode.Ratio']>5)])
1
##################################
# Identifying the numeric columns
# with First.Second.Mode.Ratio > 5.00
##################################
display(numeric_column_quality_summary[(numeric_column_quality_summary['First.Second.Mode.Ratio']>5)].sort_values(by=['First.Second.Mode.Ratio'], ascending=False))
Numeric.Column.Name | Minimum | Mean | Median | Maximum | First.Mode | Second.Mode | First.Mode.Count | Second.Mode.Count | First.Second.Mode.Ratio | Unique.Count | Row.Count | Unique.Count.Ratio | Skewness | Kurtosis | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
14 | PM2EXP | 0.274092 | 91.940595 | 100.0 | 100.0 | 100.0 | 100.0 | 106 | 2 | 53.0 | 61 | 177 | 0.344633 | -3.141557 | 9.032386 |
##################################
# Counting the number of numeric columns
# with Unique.Count.Ratio > 10.00
##################################
len(numeric_column_quality_summary[(numeric_column_quality_summary['Unique.Count.Ratio']>10)])
0
##################################
# Counting the number of numeric columns
# with Skewness > 3.00 or Skewness < -3.00
##################################
len(numeric_column_quality_summary[(numeric_column_quality_summary['Skewness']>3) | (numeric_column_quality_summary['Skewness']<(-3))])
5
##################################
# Identifying the numeric columns
# with Skewness > 3.00 or Skewness < -3.00
##################################
display(numeric_column_quality_summary[(numeric_column_quality_summary['Skewness']>3) | (numeric_column_quality_summary['Skewness']<(-3))].sort_values(by=['Skewness'], ascending=False))
Numeric.Column.Name | Minimum | Mean | Median | Maximum | First.Mode | Second.Mode | First.Mode.Count | Second.Mode.Count | First.Second.Mode.Ratio | Unique.Count | Row.Count | Unique.Count.Ratio | Skewness | Kurtosis | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
15 | POPDEN | 2.115134 | 200.886765 | 77.983133 | 7.918951e+03 | 3.335312 | 19.331586 | 1 | 1 | 1.000000 | 174 | 177 | 0.983051 | 10.267750 | 119.995256 |
9 | GHGEMI | 179.725150 | 259582.709895 | 41009.275980 | 1.294287e+07 | 571903.119900 | 80158.025830 | 1 | 1 | 1.000000 | 170 | 177 | 0.960452 | 9.496120 | 101.637308 |
2 | PATRES | 1.000000 | 20607.388889 | 244.500000 | 1.344817e+06 | 6.000000 | 2.000000 | 4 | 3 | 1.333333 | 97 | 177 | 0.548023 | 9.284436 | 91.187178 |
11 | METEMI | 11.596147 | 47876.133575 | 11118.976025 | 1.186285e+06 | 131484.763200 | 32241.937000 | 1 | 1 | 1.000000 | 170 | 177 | 0.960452 | 5.801014 | 38.661386 |
14 | PM2EXP | 0.274092 | 91.940595 | 100.000000 | 1.000000e+02 | 100.000000 | 100.000000 | 106 | 2 | 53.000000 | 61 | 177 | 0.344633 | -3.141557 | 9.032386 |
##################################
# Formulating the dataset
# with object column only
##################################
cancer_rate_object = cancer_rate.select_dtypes(include='object')
##################################
# Gathering the variable names for the object column
##################################
object_variable_name_list = cancer_rate_object.columns
##################################
# Gathering the first mode values for the object column
##################################
object_first_mode_list = [cancer_rate[x].value_counts().index.tolist()[0] for x in cancer_rate_object]
##################################
# Gathering the second mode values for each object column
##################################
object_second_mode_list = [cancer_rate[x].value_counts().index.tolist()[1] for x in cancer_rate_object]
##################################
# Gathering the count of first mode values for each object column
##################################
object_first_mode_count_list = [cancer_rate_object[x].isin([cancer_rate[x].value_counts(dropna=True).index.tolist()[0]]).sum() for x in cancer_rate_object]
##################################
# Gathering the count of second mode values for each object column
##################################
object_second_mode_count_list = [cancer_rate_object[x].isin([cancer_rate[x].value_counts(dropna=True).index.tolist()[1]]).sum() for x in cancer_rate_object]
##################################
# Gathering the first mode to second mode ratio for each object column
##################################
object_first_second_mode_ratio_list = map(truediv, object_first_mode_count_list, object_second_mode_count_list)
##################################
# Gathering the count of unique values for each object column
##################################
object_unique_count_list = cancer_rate_object.nunique(dropna=True)
##################################
# Gathering the number of observations for each object column
##################################
object_row_count_list = list([len(cancer_rate_object)] * len(cancer_rate_object.columns))
##################################
# Gathering the unique to count ratio for each object column
##################################
object_unique_count_ratio_list = map(truediv, object_unique_count_list, object_row_count_list)
object_column_quality_summary = pd.DataFrame(zip(object_variable_name_list,
object_first_mode_list,
object_second_mode_list,
object_first_mode_count_list,
object_second_mode_count_list,
object_first_second_mode_ratio_list,
object_unique_count_list,
object_row_count_list,
object_unique_count_ratio_list),
columns=['Object.Column.Name',
'First.Mode',
'Second.Mode',
'First.Mode.Count',
'Second.Mode.Count',
'First.Second.Mode.Ratio',
'Unique.Count',
'Row.Count',
'Unique.Count.Ratio'])
display(object_column_quality_summary)
Object.Column.Name | First.Mode | Second.Mode | First.Mode.Count | Second.Mode.Count | First.Second.Mode.Ratio | Unique.Count | Row.Count | Unique.Count.Ratio | |
---|---|---|---|---|---|---|---|---|---|
0 | COUNTRY | Australia | New Zealand | 1 | 1 | 1.0 | 177 | 177 | 1.0 |
##################################
# Counting the number of object columns
# with First.Second.Mode.Ratio > 5.00
##################################
len(object_column_quality_summary[(object_column_quality_summary['First.Second.Mode.Ratio']>5)])
0
##################################
# Counting the number of object columns
# with Unique.Count.Ratio > 10.00
##################################
len(object_column_quality_summary[(object_column_quality_summary['Unique.Count.Ratio']>10)])
0
##################################
# Formulating the dataset
# with categorical columns only
##################################
cancer_rate_categorical = cancer_rate.select_dtypes(include='category')
##################################
# Gathering the variable names for the categorical column
##################################
categorical_variable_name_list = cancer_rate_categorical.columns
##################################
# Gathering the first mode values for each categorical column
##################################
categorical_first_mode_list = [cancer_rate[x].value_counts().index.tolist()[0] for x in cancer_rate_categorical]
##################################
# Gathering the second mode values for each categorical column
##################################
categorical_second_mode_list = [cancer_rate[x].value_counts().index.tolist()[1] for x in cancer_rate_categorical]
##################################
# Gathering the count of first mode values for each categorical column
##################################
categorical_first_mode_count_list = [cancer_rate_categorical[x].isin([cancer_rate[x].value_counts(dropna=True).index.tolist()[0]]).sum() for x in cancer_rate_categorical]
##################################
# Gathering the count of second mode values for each categorical column
##################################
categorical_second_mode_count_list = [cancer_rate_categorical[x].isin([cancer_rate[x].value_counts(dropna=True).index.tolist()[1]]).sum() for x in cancer_rate_categorical]
##################################
# Gathering the first mode to second mode ratio for each categorical column
##################################
categorical_first_second_mode_ratio_list = map(truediv, categorical_first_mode_count_list, categorical_second_mode_count_list)
##################################
# Gathering the count of unique values for each categorical column
##################################
categorical_unique_count_list = cancer_rate_categorical.nunique(dropna=True)
##################################
# Gathering the number of observations for each categorical column
##################################
categorical_row_count_list = list([len(cancer_rate_categorical)] * len(cancer_rate_categorical.columns))
##################################
# Gathering the unique to count ratio for each categorical column
##################################
categorical_unique_count_ratio_list = map(truediv, categorical_unique_count_list, categorical_row_count_list)
categorical_column_quality_summary = pd.DataFrame(zip(categorical_variable_name_list,
categorical_first_mode_list,
categorical_second_mode_list,
categorical_first_mode_count_list,
categorical_second_mode_count_list,
categorical_first_second_mode_ratio_list,
categorical_unique_count_list,
categorical_row_count_list,
categorical_unique_count_ratio_list),
columns=['Categorical.Column.Name',
'First.Mode',
'Second.Mode',
'First.Mode.Count',
'Second.Mode.Count',
'First.Second.Mode.Ratio',
'Unique.Count',
'Row.Count',
'Unique.Count.Ratio'])
display(categorical_column_quality_summary)
Categorical.Column.Name | First.Mode | Second.Mode | First.Mode.Count | Second.Mode.Count | First.Second.Mode.Ratio | Unique.Count | Row.Count | Unique.Count.Ratio | |
---|---|---|---|---|---|---|---|---|---|
0 | CANRAT | Low | High | 132 | 45 | 2.933333 | 2 | 177 | 0.011299 |
1 | HDICAT | VH | H | 59 | 39 | 1.512821 | 4 | 177 | 0.022599 |
##################################
# Counting the number of categorical columns
# with First.Second.Mode.Ratio > 5.00
##################################
len(categorical_column_quality_summary[(categorical_column_quality_summary['First.Second.Mode.Ratio']>5)])
0
##################################
# Counting the number of categorical columns
# with Unique.Count.Ratio > 10.00
##################################
len(categorical_column_quality_summary[(categorical_column_quality_summary['Unique.Count.Ratio']>10)])
0
1.4. Data Preprocessing ¶
1.4.1 Data Cleaning ¶
- Subsets of rows and columns with high rates of missing data were removed from the dataset:
- 4 variables with Fill.Rate<0.9 were excluded for subsequent analysis.
- RNDGDP: Null.Count = 103, Fill.Rate = 0.418
- PATRES: Null.Count = 69, Fill.Rate = 0.610
- ENRTER: Null.Count = 61, Fill.Rate = 0.655
- RELOUT: Null.Count = 24, Fill.Rate = 0.864
- 14 rows with Missing.Rate>0.2 were exluded for subsequent analysis.
- COUNTRY=Guadeloupe: Missing.Rate= 0.909
- COUNTRY=Martinique: Missing.Rate= 0.909
- COUNTRY=French Guiana: Missing.Rate= 0.909
- COUNTRY=New Caledonia: Missing.Rate= 0.500
- COUNTRY=French Polynesia: Missing.Rate= 0.500
- COUNTRY=Guam: Missing.Rate= 0.500
- COUNTRY=Puerto Rico: Missing.Rate= 0.409
- COUNTRY=North Korea: Missing.Rate= 0.227
- COUNTRY=Somalia: Missing.Rate= 0.227
- COUNTRY=South Sudan: Missing.Rate= 0.227
- COUNTRY=Venezuela: Missing.Rate= 0.227
- COUNTRY=Libya: Missing.Rate= 0.227
- COUNTRY=Eritrea: Missing.Rate= 0.227
- COUNTRY=Yemen: Missing.Rate= 0.227
- 4 variables with Fill.Rate<0.9 were excluded for subsequent analysis.
- No variables were removed due to zero or near-zero variance.
- The cleaned dataset is comprised of:
- 163 rows (observations)
- 18 columns (variables)
- 1/18 metadata (object)
- COUNTRY
- 1/18 target (categorical)
- CANRAT
- 15/18 predictor (numeric)
- GDPPER
- URBPOP
- POPGRO
- LIFEXP
- TUBINC
- DTHCMD
- AGRLND
- GHGEMI
- METEMI
- FORARE
- CO2EMI
- PM2EXP
- POPDEN
- GDPCAP
- EPISCO
- 1/18 predictor (categorical)
- HDICAT
- 1/18 metadata (object)
##################################
# Performing a general exploration of the original dataset
##################################
print('Dataset Dimensions: ')
display(cancer_rate.shape)
Dataset Dimensions:
(177, 22)
##################################
# Filtering out the rows with
# with Missing.Rate > 0.20
##################################
cancer_rate_filtered_row = cancer_rate.drop(cancer_rate[cancer_rate.COUNTRY.isin(row_high_missing_rate['Row.Name'].values.tolist())].index)
##################################
# Performing a general exploration of the filtered dataset
##################################
print('Dataset Dimensions: ')
display(cancer_rate_filtered_row.shape)
Dataset Dimensions:
(163, 22)
##################################
# Filtering out the columns with
# with Fill.Rate < 0.90
##################################
cancer_rate_filtered_row_column = cancer_rate_filtered_row.drop(column_low_fill_rate['Column.Name'].values.tolist(), axis=1)
##################################
# Formulating a new dataset object
# for the cleaned data
##################################
cancer_rate_cleaned = cancer_rate_filtered_row_column
##################################
# Performing a general exploration of the filtered dataset
##################################
print('Dataset Dimensions: ')
display(cancer_rate_cleaned.shape)
Dataset Dimensions:
(163, 18)
1.4.2 Missing Data Imputation ¶
Iterative Imputer is based on the Multivariate Imputation by Chained Equations (MICE) algorithm - an imputation method based on fully conditional specification, where each incomplete variable is imputed by a separate model. As a sequential regression imputation technique, the algorithm imputes an incomplete column (target column) by generating plausible synthetic values given other columns in the data. Each incomplete column must act as a target column, and has its own specific set of predictors. For predictors that are incomplete themselves, the most recently generated imputations are used to complete the predictors prior to prior to imputation of the target columns.
Linear Regression explores the linear relationship between a scalar response and one or more covariates by having the conditional mean of the dependent variable be an affine function of the independent variables. The relationship is modeled through a disturbance term which represents an unobserved random variable that adds noise. The algorithm is typically formulated from the data using the least squares method which seeks to estimate the coefficients by minimizing the squared residual function. The linear equation assigns one scale factor represented by a coefficient to each covariate and an additional coefficient called the intercept or the bias coefficient which gives the line an additional degree of freedom allowing to move up and down a two-dimensional plot.
- Missing data for numeric variables were imputed using the iterative imputer algorithm with a linear regression estimator.
- GDPPER: Null.Count = 1
- FORARE: Null.Count = 1
- PM2EXP: Null.Count = 5
- Missing data for categorical variables were imputed using the most frequent value.
- HDICAP: Null.Count = 1
##################################
# Formulating the summary
# for all cleaned columns
##################################
cleaned_column_quality_summary = pd.DataFrame(zip(list(cancer_rate_cleaned.columns),
list(cancer_rate_cleaned.dtypes),
list([len(cancer_rate_cleaned)] * len(cancer_rate_cleaned.columns)),
list(cancer_rate_cleaned.count()),
list(cancer_rate_cleaned.isna().sum(axis=0))),
columns=['Column.Name',
'Column.Type',
'Row.Count',
'Non.Null.Count',
'Null.Count'])
display(cleaned_column_quality_summary)
Column.Name | Column.Type | Row.Count | Non.Null.Count | Null.Count | |
---|---|---|---|---|---|
0 | COUNTRY | object | 163 | 163 | 0 |
1 | CANRAT | category | 163 | 163 | 0 |
2 | GDPPER | float64 | 163 | 162 | 1 |
3 | URBPOP | float64 | 163 | 163 | 0 |
4 | POPGRO | float64 | 163 | 163 | 0 |
5 | LIFEXP | float64 | 163 | 163 | 0 |
6 | TUBINC | float64 | 163 | 163 | 0 |
7 | DTHCMD | float64 | 163 | 163 | 0 |
8 | AGRLND | float64 | 163 | 163 | 0 |
9 | GHGEMI | float64 | 163 | 163 | 0 |
10 | METEMI | float64 | 163 | 163 | 0 |
11 | FORARE | float64 | 163 | 162 | 1 |
12 | CO2EMI | float64 | 163 | 163 | 0 |
13 | PM2EXP | float64 | 163 | 158 | 5 |
14 | POPDEN | float64 | 163 | 163 | 0 |
15 | GDPCAP | float64 | 163 | 163 | 0 |
16 | HDICAT | category | 163 | 162 | 1 |
17 | EPISCO | float64 | 163 | 163 | 0 |
##################################
# Formulating the cleaned dataset
# with categorical columns only
##################################
cancer_rate_cleaned_categorical = cancer_rate_cleaned.select_dtypes(include='object')
##################################
# Formulating the cleaned dataset
# with numeric columns only
##################################
cancer_rate_cleaned_numeric = cancer_rate_cleaned.select_dtypes(include='number')
##################################
# Taking a snapshot of the cleaned dataset
##################################
cancer_rate_cleaned_numeric.head()
GDPPER | URBPOP | POPGRO | LIFEXP | TUBINC | DTHCMD | AGRLND | GHGEMI | METEMI | FORARE | CO2EMI | PM2EXP | POPDEN | GDPCAP | EPISCO | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 98380.63601 | 86.241 | 1.235701 | 83.200000 | 7.2 | 4.941054 | 46.252480 | 5.719031e+05 | 131484.763200 | 17.421315 | 14.772658 | 24.893584 | 3.335312 | 51722.06900 | 60.1 |
1 | 77541.76438 | 86.699 | 2.204789 | 82.256098 | 7.2 | 4.354730 | 38.562911 | 8.015803e+04 | 32241.937000 | 37.570126 | 6.160799 | NaN | 19.331586 | 41760.59478 | 56.7 |
2 | 198405.87500 | 63.653 | 1.029111 | 82.556098 | 5.3 | 5.684596 | 65.495718 | 5.949773e+04 | 15252.824630 | 11.351720 | 6.768228 | 0.274092 | 72.367281 | 85420.19086 | 57.4 |
3 | 130941.63690 | 82.664 | 0.964348 | 76.980488 | 2.3 | 5.302060 | 44.363367 | 5.505181e+06 | 748241.402900 | 33.866926 | 13.032828 | 3.343170 | 36.240985 | 63528.63430 | 51.1 |
4 | 113300.60110 | 88.116 | 0.291641 | 81.602439 | 4.1 | 6.826140 | 65.499675 | 4.113555e+04 | 7778.773921 | 15.711000 | 4.691237 | 56.914456 | 145.785100 | 60915.42440 | 77.9 |
##################################
# Defining the estimator to be used
# at each step of the round-robin imputation
##################################
lr = LinearRegression()
##################################
# Defining the parameter of the
# iterative imputer which will estimate
# the columns with missing values
# as a function of the other columns
# in a round-robin fashion
##################################
iterative_imputer = IterativeImputer(
estimator = lr,
max_iter = 10,
tol = 1e-10,
imputation_order = 'ascending',
random_state=88888888
)
##################################
# Implementing the iterative imputer
##################################
cancer_rate_imputed_numeric_array = iterative_imputer.fit_transform(cancer_rate_cleaned_numeric)
##################################
# Transforming the imputed data
# from an array to a dataframe
##################################
cancer_rate_imputed_numeric = pd.DataFrame(cancer_rate_imputed_numeric_array,
columns = cancer_rate_cleaned_numeric.columns)
##################################
# Taking a snapshot of the imputed dataset
##################################
cancer_rate_imputed_numeric.head()
GDPPER | URBPOP | POPGRO | LIFEXP | TUBINC | DTHCMD | AGRLND | GHGEMI | METEMI | FORARE | CO2EMI | PM2EXP | POPDEN | GDPCAP | EPISCO | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 98380.63601 | 86.241 | 1.235701 | 83.200000 | 7.2 | 4.941054 | 46.252480 | 5.719031e+05 | 131484.763200 | 17.421315 | 14.772658 | 24.893584 | 3.335312 | 51722.06900 | 60.1 |
1 | 77541.76438 | 86.699 | 2.204789 | 82.256098 | 7.2 | 4.354730 | 38.562911 | 8.015803e+04 | 32241.937000 | 37.570126 | 6.160799 | 65.867296 | 19.331586 | 41760.59478 | 56.7 |
2 | 198405.87500 | 63.653 | 1.029111 | 82.556098 | 5.3 | 5.684596 | 65.495718 | 5.949773e+04 | 15252.824630 | 11.351720 | 6.768228 | 0.274092 | 72.367281 | 85420.19086 | 57.4 |
3 | 130941.63690 | 82.664 | 0.964348 | 76.980488 | 2.3 | 5.302060 | 44.363367 | 5.505181e+06 | 748241.402900 | 33.866926 | 13.032828 | 3.343170 | 36.240985 | 63528.63430 | 51.1 |
4 | 113300.60110 | 88.116 | 0.291641 | 81.602439 | 4.1 | 6.826140 | 65.499675 | 4.113555e+04 | 7778.773921 | 15.711000 | 4.691237 | 56.914456 | 145.785100 | 60915.42440 | 77.9 |
##################################
# Formulating the cleaned dataset
# with categorical columns only
##################################
cancer_rate_cleaned_categorical = cancer_rate_cleaned.select_dtypes(include='category')
##################################
# Imputing the missing data
# for categorical columns with
# the most frequent category
##################################
cancer_rate_cleaned_categorical['HDICAT'] = cancer_rate_cleaned_categorical['HDICAT'].fillna(cancer_rate_cleaned_categorical['HDICAT'].mode()[0])
cancer_rate_imputed_categorical = cancer_rate_cleaned_categorical.reset_index(drop=True)
##################################
# Formulating the imputed dataset
##################################
cancer_rate_imputed = pd.concat([cancer_rate_imputed_numeric,cancer_rate_imputed_categorical], axis=1, join='inner')
##################################
# Gathering the data types for each column
##################################
data_type_list = list(cancer_rate_imputed.dtypes)
##################################
# Gathering the variable names for each column
##################################
variable_name_list = list(cancer_rate_imputed.columns)
##################################
# Gathering the number of observations for each column
##################################
row_count_list = list([len(cancer_rate_imputed)] * len(cancer_rate_imputed.columns))
##################################
# Gathering the number of missing data for each column
##################################
null_count_list = list(cancer_rate_imputed.isna().sum(axis=0))
##################################
# Gathering the number of non-missing data for each column
##################################
non_null_count_list = list(cancer_rate_imputed.count())
##################################
# Gathering the missing data percentage for each column
##################################
fill_rate_list = map(truediv, non_null_count_list, row_count_list)
##################################
# Formulating the summary
# for all imputed columns
##################################
imputed_column_quality_summary = pd.DataFrame(zip(variable_name_list,
data_type_list,
row_count_list,
non_null_count_list,
null_count_list,
fill_rate_list),
columns=['Column.Name',
'Column.Type',
'Row.Count',
'Non.Null.Count',
'Null.Count',
'Fill.Rate'])
display(imputed_column_quality_summary)
Column.Name | Column.Type | Row.Count | Non.Null.Count | Null.Count | Fill.Rate | |
---|---|---|---|---|---|---|
0 | GDPPER | float64 | 163 | 163 | 0 | 1.0 |
1 | URBPOP | float64 | 163 | 163 | 0 | 1.0 |
2 | POPGRO | float64 | 163 | 163 | 0 | 1.0 |
3 | LIFEXP | float64 | 163 | 163 | 0 | 1.0 |
4 | TUBINC | float64 | 163 | 163 | 0 | 1.0 |
5 | DTHCMD | float64 | 163 | 163 | 0 | 1.0 |
6 | AGRLND | float64 | 163 | 163 | 0 | 1.0 |
7 | GHGEMI | float64 | 163 | 163 | 0 | 1.0 |
8 | METEMI | float64 | 163 | 163 | 0 | 1.0 |
9 | FORARE | float64 | 163 | 163 | 0 | 1.0 |
10 | CO2EMI | float64 | 163 | 163 | 0 | 1.0 |
11 | PM2EXP | float64 | 163 | 163 | 0 | 1.0 |
12 | POPDEN | float64 | 163 | 163 | 0 | 1.0 |
13 | GDPCAP | float64 | 163 | 163 | 0 | 1.0 |
14 | EPISCO | float64 | 163 | 163 | 0 | 1.0 |
15 | CANRAT | category | 163 | 163 | 0 | 1.0 |
16 | HDICAT | category | 163 | 163 | 0 | 1.0 |
1.4.3 Outlier Detection ¶
- High number of outliers observed for 5 numeric variables with Outlier.Ratio>0.10 and marginal to high Skewness.
- PM2EXP: Outlier.Count = 37, Outlier.Ratio = 0.226, Skewness=-3.061
- GHGEMI: Outlier.Count = 27, Outlier.Ratio = 0.165, Skewness=+9.299
- GDPCAP: Outlier.Count = 22, Outlier.Ratio = 0.134, Skewness=+2.311
- POPDEN: Outlier.Count = 20, Outlier.Ratio = 0.122, Skewness=+9.972
- METEMI: Outlier.Count = 20, Outlier.Ratio = 0.122, Skewness=+5.688
- Minimal number of outliers observed for 5 numeric variables with Outlier.Ratio<0.10 and normal Skewness.
- TUBINC: Outlier.Count = 12, Outlier.Ratio = 0.073, Skewness=+1.747
- CO2EMI: Outlier.Count = 11, Outlier.Ratio = 0.067, Skewness=+2.693
- GDPPER: Outlier.Count = 3, Outlier.Ratio = 0.018, Skewness=+1.554
- EPISCO: Outlier.Count = 3, Outlier.Ratio = 0.018, Skewness=+0.635
- CANRAT: Outlier.Count = 2, Outlier.Ratio = 0.012, Skewness=+0.910
##################################
# Formulating the imputed dataset
# with numeric columns only
##################################
cancer_rate_imputed_numeric = cancer_rate_imputed.select_dtypes(include='number')
##################################
# Gathering the variable names for each numeric column
##################################
numeric_variable_name_list = list(cancer_rate_imputed_numeric.columns)
##################################
# Gathering the skewness value for each numeric column
##################################
numeric_skewness_list = cancer_rate_imputed_numeric.skew()
##################################
# Computing the interquartile range
# for all columns
##################################
cancer_rate_imputed_numeric_q1 = cancer_rate_imputed_numeric.quantile(0.25)
cancer_rate_imputed_numeric_q3 = cancer_rate_imputed_numeric.quantile(0.75)
cancer_rate_imputed_numeric_iqr = cancer_rate_imputed_numeric_q3 - cancer_rate_imputed_numeric_q1
##################################
# Gathering the outlier count for each numeric column
# based on the interquartile range criterion
##################################
numeric_outlier_count_list = ((cancer_rate_imputed_numeric < (cancer_rate_imputed_numeric_q1 - 1.5 * cancer_rate_imputed_numeric_iqr)) | (cancer_rate_imputed_numeric > (cancer_rate_imputed_numeric_q3 + 1.5 * cancer_rate_imputed_numeric_iqr))).sum()
##################################
# Gathering the number of observations for each column
##################################
numeric_row_count_list = list([len(cancer_rate_imputed_numeric)] * len(cancer_rate_imputed_numeric.columns))
##################################
# Gathering the unique to count ratio for each categorical column
##################################
numeric_outlier_ratio_list = map(truediv, numeric_outlier_count_list, numeric_row_count_list)
##################################
# Formulating the outlier summary
# for all numeric columns
##################################
numeric_column_outlier_summary = pd.DataFrame(zip(numeric_variable_name_list,
numeric_skewness_list,
numeric_outlier_count_list,
numeric_row_count_list,
numeric_outlier_ratio_list),
columns=['Numeric.Column.Name',
'Skewness',
'Outlier.Count',
'Row.Count',
'Outlier.Ratio'])
display(numeric_column_outlier_summary)
Numeric.Column.Name | Skewness | Outlier.Count | Row.Count | Outlier.Ratio | |
---|---|---|---|---|---|
0 | GDPPER | 1.554457 | 3 | 163 | 0.018405 |
1 | URBPOP | -0.212327 | 0 | 163 | 0.000000 |
2 | POPGRO | -0.181666 | 0 | 163 | 0.000000 |
3 | LIFEXP | -0.329704 | 0 | 163 | 0.000000 |
4 | TUBINC | 1.747962 | 12 | 163 | 0.073620 |
5 | DTHCMD | 0.930709 | 0 | 163 | 0.000000 |
6 | AGRLND | 0.035315 | 0 | 163 | 0.000000 |
7 | GHGEMI | 9.299960 | 27 | 163 | 0.165644 |
8 | METEMI | 5.688689 | 20 | 163 | 0.122699 |
9 | FORARE | 0.563015 | 0 | 163 | 0.000000 |
10 | CO2EMI | 2.693585 | 11 | 163 | 0.067485 |
11 | PM2EXP | -3.088403 | 37 | 163 | 0.226994 |
12 | POPDEN | 9.972806 | 20 | 163 | 0.122699 |
13 | GDPCAP | 2.311079 | 22 | 163 | 0.134969 |
14 | EPISCO | 0.635994 | 3 | 163 | 0.018405 |
##################################
# Formulating the individual boxplots
# for all numeric columns
##################################
for column in cancer_rate_imputed_numeric:
plt.figure(figsize=(17,1))
sns.boxplot(data=cancer_rate_imputed_numeric, x=column)
1.4.4 Collinearity ¶
Pearson’s Correlation Coefficient is a parametric measure of the linear correlation for a pair of features by calculating the ratio between their covariance and the product of their standard deviations. The presence of high absolute correlation values indicate the univariate association between the numeric predictors and the numeric response.
- Majority of the numeric variables reported moderate to high correlation which were statistically significant.
- Among pairwise combinations of numeric variables, high Pearson.Correlation.Coefficient values were noted for:
- GDPPER and GDPCAP: Pearson.Correlation.Coefficient = +0.921
- GHGEMI and METEMI: Pearson.Correlation.Coefficient = +0.905
- Among the highly correlated pairs, variables with the lowest correlation against the target variable were removed.
- GDPPER: Pearson.Correlation.Coefficient = +0.690
- METEMI: Pearson.Correlation.Coefficient = +0.062
- The cleaned dataset is comprised of:
- 163 rows (observations)
- 16 columns (variables)
- 1/16 metadata (object)
- COUNTRY
- 1/16 target (categorical)
- CANRAT
- 13/16 predictor (numeric)
- URBPOP
- POPGRO
- LIFEXP
- TUBINC
- DTHCMD
- AGRLND
- GHGEMI
- FORARE
- CO2EMI
- PM2EXP
- POPDEN
- GDPCAP
- EPISCO
- 1/16 predictor (categorical)
- HDICAT
- 1/16 metadata (object)
##################################
# Formulating a function
# to plot the correlation matrix
# for all pairwise combinations
# of numeric columns
##################################
def plot_correlation_matrix(corr, mask=None):
f, ax = plt.subplots(figsize=(11, 9))
sns.heatmap(corr,
ax=ax,
mask=mask,
annot=True,
vmin=-1,
vmax=1,
center=0,
cmap='coolwarm',
linewidths=1,
linecolor='gray',
cbar_kws={'orientation': 'horizontal'})
##################################
# Computing the correlation coefficients
# and correlation p-values
# among pairs of numeric columns
##################################
cancer_rate_imputed_numeric_correlation_pairs = {}
cancer_rate_imputed_numeric_columns = cancer_rate_imputed_numeric.columns.tolist()
for numeric_column_a, numeric_column_b in itertools.combinations(cancer_rate_imputed_numeric_columns, 2):
cancer_rate_imputed_numeric_correlation_pairs[numeric_column_a + '_' + numeric_column_b] = stats.pearsonr(
cancer_rate_imputed_numeric.loc[:, numeric_column_a],
cancer_rate_imputed_numeric.loc[:, numeric_column_b])
##################################
# Formulating the pairwise correlation summary
# for all numeric columns
##################################
cancer_rate_imputed_numeric_summary = cancer_rate_imputed_numeric.from_dict(cancer_rate_imputed_numeric_correlation_pairs, orient='index')
cancer_rate_imputed_numeric_summary.columns = ['Pearson.Correlation.Coefficient', 'Correlation.PValue']
display(cancer_rate_imputed_numeric_summary.sort_values(by=['Pearson.Correlation.Coefficient'], ascending=False).head(20))
Pearson.Correlation.Coefficient | Correlation.PValue | |
---|---|---|
GDPPER_GDPCAP | 0.921010 | 8.158179e-68 |
GHGEMI_METEMI | 0.905121 | 1.087643e-61 |
POPGRO_DTHCMD | 0.759470 | 7.124695e-32 |
GDPPER_LIFEXP | 0.755787 | 2.055178e-31 |
GDPCAP_EPISCO | 0.696707 | 5.312642e-25 |
LIFEXP_GDPCAP | 0.683834 | 8.321371e-24 |
GDPPER_EPISCO | 0.680812 | 1.555304e-23 |
GDPPER_URBPOP | 0.666394 | 2.781623e-22 |
GDPPER_CO2EMI | 0.654958 | 2.450029e-21 |
TUBINC_DTHCMD | 0.643615 | 1.936081e-20 |
URBPOP_LIFEXP | 0.623997 | 5.669778e-19 |
LIFEXP_EPISCO | 0.620271 | 1.048393e-18 |
URBPOP_GDPCAP | 0.559181 | 8.624533e-15 |
CO2EMI_GDPCAP | 0.550221 | 2.782997e-14 |
URBPOP_CO2EMI | 0.550046 | 2.846393e-14 |
LIFEXP_CO2EMI | 0.531305 | 2.951829e-13 |
URBPOP_EPISCO | 0.510131 | 3.507463e-12 |
POPGRO_TUBINC | 0.442339 | 3.384403e-09 |
DTHCMD_PM2EXP | 0.283199 | 2.491837e-04 |
CO2EMI_EPISCO | 0.282734 | 2.553620e-04 |
##################################
# Plotting the correlation matrix
# for all pairwise combinations
# of numeric columns
##################################
cancer_rate_imputed_numeric_correlation = cancer_rate_imputed_numeric.corr()
mask = np.triu(cancer_rate_imputed_numeric_correlation)
plot_correlation_matrix(cancer_rate_imputed_numeric_correlation,mask)
plt.show()
##################################
# Formulating a function
# to plot the correlation matrix
# for all pairwise combinations
# of numeric columns
# with significant p-values only
##################################
def correlation_significance(df=None):
p_matrix = np.zeros(shape=(df.shape[1],df.shape[1]))
for col in df.columns:
for col2 in df.drop(col,axis=1).columns:
_ , p = stats.pearsonr(df[col],df[col2])
p_matrix[df.columns.to_list().index(col),df.columns.to_list().index(col2)] = p
return p_matrix
##################################
# Plotting the correlation matrix
# for all pairwise combinations
# of numeric columns
# with significant p-values only
##################################
cancer_rate_imputed_numeric_correlation_p_values = correlation_significance(cancer_rate_imputed_numeric)
mask = np.invert(np.tril(cancer_rate_imputed_numeric_correlation_p_values<0.05))
plot_correlation_matrix(cancer_rate_imputed_numeric_correlation,mask)
##################################
# Filtering out one among the
# highly correlated variable pairs with
# lesser Pearson.Correlation.Coefficient
# when compared to the target variable
##################################
cancer_rate_imputed_numeric.drop(['GDPPER','METEMI'], inplace=True, axis=1)
##################################
# Performing a general exploration of the filtered dataset
##################################
print('Dataset Dimensions: ')
display(cancer_rate_imputed_numeric.shape)
Dataset Dimensions:
(163, 13)
1.4.5 Shape Transformation ¶
Yeo-Johnson Transformation applies a new family of distributions that can be used without restrictions, extending many of the good properties of the Box-Cox power family. Similar to the Box-Cox transformation, the method also estimates the optimal value of lambda but has the ability to transform both positive and negative values by inflating low variance data and deflating high variance data to create a more uniform data set. While there are no restrictions in terms of the applicable values, the interpretability of the transformed values is more diminished as compared to the other methods.
- A Yeo-Johnson transformation was applied to all numeric variables to improve distributional shape.
- Most variables achieved symmetrical distributions with minimal outliers after transformation.
- One variable which remained skewed even after applying shape transformation was removed.
- PM2EXP
- The transformed dataset is comprised of:
- 163 rows (observations)
- 15 columns (variables)
- 1/15 metadata (object)
- COUNTRY
- 1/15 target (categorical)
- CANRAT
- 12/15 predictor (numeric)
- URBPOP
- POPGRO
- LIFEXP
- TUBINC
- DTHCMD
- AGRLND
- GHGEMI
- FORARE
- CO2EMI
- POPDEN
- GDPCAP
- EPISCO
- 1/15 predictor (categorical)
- HDICAT
- 1/15 metadata (object)
##################################
# Conducting a Yeo-Johnson Transformation
# to address the distributional
# shape of the variables
##################################
yeo_johnson_transformer = PowerTransformer(method='yeo-johnson',
standardize=False)
cancer_rate_imputed_numeric_array = yeo_johnson_transformer.fit_transform(cancer_rate_imputed_numeric)
##################################
# Formulating a new dataset object
# for the transformed data
##################################
cancer_rate_transformed_numeric = pd.DataFrame(cancer_rate_imputed_numeric_array,
columns=cancer_rate_imputed_numeric.columns)
##################################
# Formulating the individual boxplots
# for all transformed numeric columns
##################################
for column in cancer_rate_transformed_numeric:
plt.figure(figsize=(17,1))
sns.boxplot(data=cancer_rate_transformed_numeric, x=column)
##################################
# Filtering out the column
# which remained skewed even
# after applying shape transformation
##################################
cancer_rate_transformed_numeric.drop(['PM2EXP'], inplace=True, axis=1)
##################################
# Performing a general exploration of the filtered dataset
##################################
print('Dataset Dimensions: ')
display(cancer_rate_transformed_numeric.shape)
Dataset Dimensions:
(163, 12)
1.4.6 Centering and Scaling ¶
- All numeric variables were transformed using the standardization method to achieve a comparable scale between values.
- The scaled dataset is comprised of:
- 163 rows (observations)
- 15 columns (variables)
- 1/15 metadata (object)
- COUNTRY
- 1/15 target (categorical)
- CANRAT
- 12/15 predictor (numeric)
- URBPOP
- POPGRO
- LIFEXP
- TUBINC
- DTHCMD
- AGRLND
- GHGEMI
- FORARE
- CO2EMI
- POPDEN
- GDPCAP
- EPISCO
- 1/15 predictor (categorical)
- HDICAT
- 1/15 metadata (object)
##################################
# Conducting standardization
# to transform the values of the
# variables into comparable scale
##################################
standardization_scaler = StandardScaler()
cancer_rate_transformed_numeric_array = standardization_scaler.fit_transform(cancer_rate_transformed_numeric)
##################################
# Formulating a new dataset object
# for the scaled data
##################################
cancer_rate_scaled_numeric = pd.DataFrame(cancer_rate_transformed_numeric_array,
columns=cancer_rate_transformed_numeric.columns)
##################################
# Formulating the individual boxplots
# for all transformed numeric columns
##################################
for column in cancer_rate_scaled_numeric:
plt.figure(figsize=(17,1))
sns.boxplot(data=cancer_rate_scaled_numeric, x=column)
1.4.7 Data Encoding ¶
- One-hot encoding was applied to the HDICAP_VH variable resulting to 4 additional columns in the dataset:
- HDICAP_L
- HDICAP_M
- HDICAP_H
- HDICAP_VH
##################################
# Formulating the categorical column
# for encoding transformation
##################################
cancer_rate_categorical_encoded = pd.DataFrame(cancer_rate_cleaned_categorical.loc[:, 'HDICAT'].to_list(),
columns=['HDICAT'])
##################################
# Applying a one-hot encoding transformation
# for the categorical column
##################################
cancer_rate_categorical_encoded = pd.get_dummies(cancer_rate_categorical_encoded, columns=['HDICAT'])
1.4.8 Preprocessed Data Description ¶
- The preprocessed dataset is comprised of:
- 163 rows (observations)
- 18 columns (variables)
- 1/18 metadata (object)
- COUNTRY
- 1/18 target (categorical)
- CANRAT
- 12/18 predictor (numeric)
- URBPOP
- POPGRO
- LIFEXP
- TUBINC
- DTHCMD
- AGRLND
- GHGEMI
- FORARE
- CO2EMI
- POPDEN
- GDPCAP
- EPISCO
- 4/18 predictor (categorical)
- HDICAT_L
- HDICAT_M
- HDICAT_H
- HDICAT_VH
- 1/18 metadata (object)
##################################
# Consolidating both numeric columns
# and encoded categorical columns
##################################
cancer_rate_preprocessed = pd.concat([cancer_rate_scaled_numeric,cancer_rate_categorical_encoded], axis=1, join='inner')
##################################
# Performing a general exploration of the consolidated dataset
##################################
print('Dataset Dimensions: ')
display(cancer_rate_preprocessed.shape)
Dataset Dimensions:
(163, 16)
1.5. Data Exploration ¶
1.5.1 Exploratory Data Analysis ¶
- Bivariate analysis identified individual predictors with generally positive association to the target variable based on visual inspection.
- Higher values or higher proportions for the following predictors are associated with the CANRAT HIGH category:
- URBPOP
- LIFEXP
- CO2EMI
- GDPCAP
- EPISCO
- HDICAP_VH=1
- Decreasing values or smaller proportions for the following predictors are associated with the CANRAT LOW category:
- POPGRO
- TUBINC
- DTHCMD
- HDICAP_L=0
- HDICAP_M=0
- HDICAP_H=0
- Values for the following predictors are not associated with the CANRAT HIGH or LOW categories:
- AGRLND
- GHGEMI
- FORARE
- POPDEN
##################################
# Segregating the target
# and predictor variable lists
##################################
cancer_rate_preprocessed_target = cancer_rate_filtered_row['CANRAT'].to_frame()
cancer_rate_preprocessed_target.reset_index(inplace=True, drop=True)
cancer_rate_preprocessed_categorical = cancer_rate_preprocessed[cancer_rate_categorical_encoded.columns]
cancer_rate_preprocessed_categorical_combined = cancer_rate_preprocessed_categorical.join(cancer_rate_preprocessed_target)
cancer_rate_preprocessed = cancer_rate_preprocessed.drop(cancer_rate_categorical_encoded.columns, axis=1)
cancer_rate_preprocessed_predictors = cancer_rate_preprocessed.columns
cancer_rate_preprocessed_combined = cancer_rate_preprocessed.join(cancer_rate_preprocessed_target)
##################################
# Segregating the target
# and predictor variable names
##################################
y_variable = 'CANRAT'
x_variables = cancer_rate_preprocessed_predictors
##################################
# Defining the number of
# rows and columns for the subplots
##################################
num_rows = 6
num_cols = 2
##################################
# Formulating the subplot structure
##################################
fig, axes = plt.subplots(num_rows, num_cols, figsize=(15, 30))
##################################
# Flattening the multi-row and
# multi-column axes
##################################
axes = axes.ravel()
##################################
# Formulating the individual boxplots
# for all scaled numeric columns
##################################
for i, x_variable in enumerate(x_variables):
ax = axes[i]
ax.boxplot([group[x_variable] for name, group in cancer_rate_preprocessed_combined.groupby(y_variable, observed=True)])
ax.set_title(f'{y_variable} Versus {x_variable}')
ax.set_xlabel(y_variable)
ax.set_ylabel(x_variable)
ax.set_xticks(range(1, len(cancer_rate_preprocessed_combined[y_variable].unique()) + 1), ['Low', 'High'])
##################################
# Adjusting the subplot layout
##################################
plt.tight_layout()
##################################
# Presenting the subplots
##################################
plt.show()
##################################
# Segregating the target
# and predictor variable names
##################################
y_variables = cancer_rate_preprocessed_categorical.columns
x_variable = 'CANRAT'
##################################
# Defining the number of
# rows and columns for the subplots
##################################
num_rows = 2
num_cols = 2
##################################
# Formulating the subplot structure
##################################
fig, axes = plt.subplots(num_rows, num_cols, figsize=(15, 10))
##################################
# Flattening the multi-row and
# multi-column axes
##################################
axes = axes.ravel()
##################################
# Formulating the individual stacked column plots
# for all categorical columns
##################################
for i, y_variable in enumerate(y_variables):
ax = axes[i]
category_counts = cancer_rate_preprocessed_categorical_combined.groupby([x_variable, y_variable], observed=True).size().unstack(fill_value=0)
category_proportions = category_counts.div(category_counts.sum(axis=1), axis=0)
category_proportions.plot(kind='bar', stacked=True, ax=ax)
ax.set_title(f'{x_variable} Versus {y_variable}')
ax.set_xlabel(x_variable)
ax.set_ylabel('Proportions')
##################################
# Adjusting the subplot layout
##################################
plt.tight_layout()
##################################
# Presenting the subplots
##################################
plt.show()
1.5.2 Hypothesis Testing ¶
- The relationship between the numeric predictors to the CANRAT target variable was statistically evaluated using the following hypotheses:
- Null: Difference in the means between groups LOW and HIGH is equal to zero
- Alternative: Difference in the means between groups LOW and HIGH is not equal to zero
- There is sufficient evidence to conclude of a statistically significant difference between the means of the numeric measurements obtained from LOW and HIGH groups of the CANRAT target variable in 9 of the 12 numeric predictors given their high t-test statistic values with reported low p-values less than the significance level of 0.05.
- GDPCAP: T.Test.Statistic=-11.937, Correlation.PValue=0.000
- EPISCO: T.Test.Statistic=-11.789, Correlation.PValue=0.000
- LIFEXP: T.Test.Statistic=-10.979, Correlation.PValue=0.000
- TUBINC: T.Test.Statistic=+9.609, Correlation.PValue=0.000
- DTHCMD: T.Test.Statistic=+8.376, Correlation.PValue=0.000
- CO2EMI: T.Test.Statistic=-7.031, Correlation.PValue=0.000
- URBPOP: T.Test.Statistic=-6.541, Correlation.PValue=0.000
- POPGRO: T.Test.Statistic=+4.905, Correlation.PValue=0.000
- GHGEMI: T.Test.Statistic=-2.243, Correlation.PValue=0.026
- The relationship between the categorical predictors to the CANRAT target variable was statistically evaluated using the following hypotheses:
- Null: The categorical predictor is independent of the categorical target variable
- Alternative: The categorical predictor is dependent of the categorical target variable
- There is sufficient evidence to conclude of a statistically significant relationship difference between the categories of the categorical predictors and the LOW and HIGH groups of the CANRAT target variable in all 4 categorical predictors given their high chisquare statistic values with reported low p-values less than the significance level of 0.05.
- HDICAT_VH: ChiSquare.Test.Statistic=76.764, ChiSquare.Test.PValue=0.000
- HDICAT_H: ChiSquare.Test.Statistic=13.860, ChiSquare.Test.PValue=0.000
- HDICAT_M: ChiSquare.Test.Statistic=10.286, ChiSquare.Test.PValue=0.001
- HDICAT_L: ChiSquare.Test.Statistic=9.081, ChiSquare.Test.PValue=0.002
##################################
# Computing the t-test
# statistic and p-values
# between the target variable
# and numeric predictor columns
##################################
cancer_rate_preprocessed_numeric_ttest_target = {}
cancer_rate_preprocessed_numeric = cancer_rate_preprocessed_combined
cancer_rate_preprocessed_numeric_columns = cancer_rate_preprocessed_predictors
for numeric_column in cancer_rate_preprocessed_numeric_columns:
group_0 = cancer_rate_preprocessed_numeric[cancer_rate_preprocessed_numeric.loc[:,'CANRAT']=='Low']
group_1 = cancer_rate_preprocessed_numeric[cancer_rate_preprocessed_numeric.loc[:,'CANRAT']=='High']
cancer_rate_preprocessed_numeric_ttest_target['CANRAT_' + numeric_column] = stats.ttest_ind(
group_0[numeric_column],
group_1[numeric_column],
equal_var=True)
##################################
# Formulating the pairwise ttest summary
# between the target variable
# and numeric predictor columns
##################################
cancer_rate_preprocessed_numeric_summary = cancer_rate_preprocessed_numeric.from_dict(cancer_rate_preprocessed_numeric_ttest_target, orient='index')
cancer_rate_preprocessed_numeric_summary.columns = ['T.Test.Statistic', 'T.Test.PValue']
display(cancer_rate_preprocessed_numeric_summary.sort_values(by=['T.Test.PValue'], ascending=True).head(12))
T.Test.Statistic | T.Test.PValue | |
---|---|---|
CANRAT_GDPCAP | -11.936988 | 6.247937e-24 |
CANRAT_EPISCO | -11.788870 | 1.605980e-23 |
CANRAT_LIFEXP | -10.979098 | 2.754214e-21 |
CANRAT_TUBINC | 9.608760 | 1.463678e-17 |
CANRAT_DTHCMD | 8.375558 | 2.552108e-14 |
CANRAT_CO2EMI | -7.030702 | 5.537463e-11 |
CANRAT_URBPOP | -6.541001 | 7.734940e-10 |
CANRAT_POPGRO | 4.904817 | 2.269446e-06 |
CANRAT_GHGEMI | -2.243089 | 2.625563e-02 |
CANRAT_FORARE | -1.174143 | 2.420717e-01 |
CANRAT_POPDEN | -0.495221 | 6.211191e-01 |
CANRAT_AGRLND | -0.047628 | 9.620720e-01 |
##################################
# Computing the chisquare
# statistic and p-values
# between the target variable
# and categorical predictor columns
##################################
cancer_rate_preprocessed_categorical_chisquare_target = {}
cancer_rate_preprocessed_categorical = cancer_rate_preprocessed_categorical_combined
cancer_rate_preprocessed_categorical_columns = ['HDICAT_L','HDICAT_M','HDICAT_H','HDICAT_VH']
for categorical_column in cancer_rate_preprocessed_categorical_columns:
contingency_table = pd.crosstab(cancer_rate_preprocessed_categorical[categorical_column],
cancer_rate_preprocessed_categorical['CANRAT'])
cancer_rate_preprocessed_categorical_chisquare_target['CANRAT_' + categorical_column] = stats.chi2_contingency(
contingency_table)[0:2]
##################################
# Formulating the pairwise chisquare summary
# between the target variable
# and categorical predictor columns
##################################
cancer_rate_preprocessed_categorical_summary = cancer_rate_preprocessed_categorical.from_dict(cancer_rate_preprocessed_categorical_chisquare_target, orient='index')
cancer_rate_preprocessed_categorical_summary.columns = ['ChiSquare.Test.Statistic', 'ChiSquare.Test.PValue']
display(cancer_rate_preprocessed_categorical_summary.sort_values(by=['ChiSquare.Test.PValue'], ascending=True).head(4))
ChiSquare.Test.Statistic | ChiSquare.Test.PValue | |
---|---|---|
CANRAT_HDICAT_VH | 76.764134 | 1.926446e-18 |
CANRAT_HDICAT_M | 13.860367 | 1.969074e-04 |
CANRAT_HDICAT_L | 10.285575 | 1.340742e-03 |
CANRAT_HDICAT_H | 9.080788 | 2.583087e-03 |
1.6. Neural Network Classification Gradient and Weight Updates ¶
1.6.1 Premodelling Data Description ¶
- Among the predictor variables determined to have a statistically significant difference between the means of the numeric measurements obtained from LOW and HIGH groups of the CANRAT target variable, only 2 were retained with the highest absolute t-test statistic values with reported low p-values less than the significance level of 0.05..
- GDPCAP: T.Test.Statistic=-11.937, Correlation.PValue=0.000
- EPISCO: T.Test.Statistic=-11.789, Correlation.PValue=0.000
##################################
# Filtering certain numeric columns
# and encoded categorical columns
# after hypothesis testing
##################################
cancer_rate_premodelling = cancer_rate_preprocessed_combined.drop(['URBPOP', 'POPGRO', 'LIFEXP', 'TUBINC', 'DTHCMD', 'AGRLND', 'GHGEMI','FORARE', 'CO2EMI', 'POPDEN'], axis=1)
cancer_rate_premodelling.columns
Index(['GDPCAP', 'EPISCO', 'CANRAT'], dtype='object')
##################################
# Performing a general exploration of the filtered dataset
##################################
print('Dataset Dimensions: ')
display(cancer_rate_premodelling.shape)
Dataset Dimensions:
(163, 3)
##################################
# Listing the column names and data types
##################################
print('Column Names and Data Types:')
display(cancer_rate_premodelling.dtypes)
Column Names and Data Types:
GDPCAP float64 EPISCO float64 CANRAT category dtype: object
##################################
# Taking a snapshot of the dataset
##################################
cancer_rate_premodelling.head()
GDPCAP | EPISCO | CANRAT | |
---|---|---|---|
0 | 1.549766 | 1.306738 | High |
1 | 1.407752 | 1.102912 | High |
2 | 1.879374 | 1.145832 | High |
3 | 1.685426 | 0.739753 | High |
4 | 1.657777 | 2.218327 | High |
##################################
# Converting the dataframe to
# a numpy array
##################################
cancer_rate_premodelling_matrix = cancer_rate_premodelling.to_numpy()
##################################
# Formulating the scatterplot
# of the selected numeric predictors
# by categorical response classes
##################################
fig, ax = plt.subplots(figsize=(7, 7))
ax.plot(cancer_rate_premodelling_matrix[cancer_rate_premodelling_matrix[:,2]=='High', 0],
cancer_rate_premodelling_matrix[cancer_rate_premodelling_matrix[:,2]=='High', 1],
'o',
label='High',
color='darkslateblue')
ax.plot(cancer_rate_premodelling_matrix[cancer_rate_premodelling_matrix[:,2]=='Low', 0],
cancer_rate_premodelling_matrix[cancer_rate_premodelling_matrix[:,2]=='Low', 1],
'x',
label='Low',
color='chocolate')
ax.axes.set_ylabel('EPISCO')
ax.axes.set_xlabel('GDPCAP')
ax.set_xlim(-3,3)
ax.set_ylim(-3,3)
ax.set(title='CANRAT Class Distribution')
ax.legend(loc='upper left',title='CANRAT');
##################################
# Preparing the data and
# and converting to a suitable format
# as a neural network model input
##################################
matrix_x_values = cancer_rate_premodelling.iloc[:,0:2].to_numpy()
y_values = np.where(cancer_rate_premodelling['CANRAT'] == 'High', 1, 0)
1.6.2 Stochastic Gradient Descent Optimization ¶
Backpropagation and Weight Update, in the context of an artificial neural network, involve the process of iteratively adjusting the weights of the connections between neurons in the network to minimize the difference between the predicted and the actual target responses. Input data is fed into the neural network, and it propagates through the network layer by layer, starting from the input layer, through hidden layers, and ending at the output layer. At each neuron, the weighted sum of inputs is calculated, followed by the application of an activation function to produce the neuron's output. Once the forward pass is complete, the network's output is compared to the actual target output. The difference between the predicted output and the actual output is quantified using a loss function, which measures the discrepancy between the predicted and actual values. Common loss functions for classification tasks include cross-entropy loss. During the backward pass, the error is propagated backward through the network to compute the gradients of the loss function with respect to each weight in the network. This is achieved using the chain rule of calculus, which allows the error to be decomposed and distributed backward through the network. The gradients quantify how much a change in each weight would affect the overall error of the network. Once the gradients are computed, the weights are updated in the opposite direction of the gradient to minimize the error. This update is typically performed using an optimization algorithm such as gradient descent, which adjusts the weights in proportion to their gradients and a learning rate hyperparameter. The learning rate determines the size of the step taken in the direction opposite to the gradient. These steps are repeated for multiple iterations (epochs) over the training data. As the training progresses, the weights are adjusted iteratively to minimize the error, leading to a neural network model that accurately classifies input data.
Optimization Algorithms, in the context of neural network classification, are methods used to adjust the parameters (weights and biases) of a neural network during the training process in order to minimize a predefined loss function. The primary goal of these algorithms is to optimize the performance of the neural network by iteratively updating its parameters based on the feedback provided by the training data. Optimization algorithms play a critical role in the training of neural networks because they determine how effectively the network learns from the data and how quickly it converges to an optimal solution. These algorithms are significant during model development in improving model accuracy (optimization algorithms help improve the accuracy of neural network models by minimizing the classification error on the training data), enhancing generalization (by minimizing the loss function during training, optimization algorithms aim to generalize well to unseen data, thereby improving the model's ability to make accurate predictions on new inputs), reducing training time (efficient optimization algorithms can accelerate the convergence of the training process, leading to shorter training times for neural networks), handling complex data (since neural networks often deal with high-dimensional and non-linear data, optimization algorithms enable neural networks to effectively learn complex patterns and relationships within the data, leading to improved classification performance) and adapting to variations in data (optimization algorithms can adapt the model's parameters based on variations in the training data, ensuring robustness and stability in the face of different input distributions or data characteristics).
Stochastic Gradient Descent Optimization (SGD) works by iteratively updating the parameters of the neural network in the direction of the negative gradient of the loss function with respect to the parameters. Unlike traditional gradient descent, which computes the gradient using the entire training dataset, SGD computes the gradient using a single randomly selected sample (or a mini-batch of samples) from the dataset. This randomness introduces noise into the gradient estimates but allows SGD to make frequent updates and converge faster. The SGD process involves initializing the parameters of the neural network randomly, shuffling the training dataset and repeating the following steps until convergence - randomly selecting a sample (or a mini-batch of samples) from the dataset, computing the gradient of the loss function with respect to the parameters using the selected sample(s) and updating the parameters using the gradient and a defined learning rate. SGD demonstrates several advantages over other optimization methods in terms of efficiency (SGD is computationally efficient, especially when dealing with large datasets. It updates the parameters using only a subset of the training data in each iteration, making it suitable for training on datasets with millions or even billions of samples.), regularization (SGD introduces noise into the optimization process, which acts as a form of regularization. This helps prevent overfitting and improves the generalization ability of the neural network, especially in situations with limited training data.), and scalability (SGD scales well to deep neural network architectures with a large number of parameters. It can handle complex models with millions of parameters efficiently, making it suitable for modern deep learning applications.). However, some disadvantages of SGD include the variance in gradient estimates (SGD's reliance on single-sample or mini-batch gradient estimates introduces variance into the optimization process. This variance can lead to noisy updates and slow convergence, especially when using small mini-batch sizes.), sensitivity to learning rate (SGD's performance is sensitive to the choice of learning rate. Setting the learning rate too high may lead to unstable updates and divergence, while setting it too low may result in slow convergence and prolonged training times.), and difficulty in choosing learning rate schedule (SGD requires careful tuning of the learning rate schedule to ensure optimal convergence. Finding the right learning rate schedule can be challenging and may require extensive experimentation.)
- A neural network with the following structure was formulated:
- Hidden Layer = 3
- Number of Nodes per Hidden Layer = 5
- The backpropagation and optimization algorithms were implemented with parameter settings described as follows:
- Learning Rate = 0.01
- Epochs = 1000
- Hidden Layer Activation Function = Sigmoid Activation Function
- Output Layer Activation Function = Softmax Activation Function
- Loss Function Optimization Method = Stochastic Gradient Descent (SGD)
- The final loss estimate determined as 0.18713 at the 1000th epoch was not optimally low as compared to those obtained using the other optimization methods.
- Applying parameter updates using an SGD cost function optimization, the neural network model performance is estimated as follows:
- Accuracy = 92.63804
- The estimated classification accuracy using the SGD cost function optimization was not optimal as compared to those obtained using the other optimization methods.
##################################
# Defining the neural network architecture
##################################
input_dim = 2
hidden_dims = [5, 5, 5]
output_dim = 2
##################################
# Initializing model weights and biases
##################################
params = {}
np.random.seed(88888)
params['W1'] = np.random.randn(input_dim, hidden_dims[0])
params['b1'] = np.zeros(hidden_dims[0])
params['W2'] = np.random.randn(hidden_dims[0], hidden_dims[1])
params['b2'] = np.zeros(hidden_dims[1])
params['W3'] = np.random.randn(hidden_dims[1], hidden_dims[2])
params['b3'] = np.zeros(hidden_dims[2])
params['W4'] = np.random.randn(hidden_dims[2], output_dim)
params['b4'] = np.zeros(output_dim)
##################################
# Defining the activation function (ReLU)
##################################
def relu(x):
return np.maximum(0, x)
##################################
# Defining the Softmax function
##################################
def softmax(x):
exp_x = np.exp(x - np.max(x, axis=1, keepdims=True))
return exp_x / np.sum(exp_x, axis=1, keepdims=True)
##################################
# Defining the Forward propagation algorithm
##################################
def forward(X, params):
Z1 = np.dot(X, params['W1']) + params['b1']
A1 = relu(Z1)
Z2 = np.dot(A1, params['W2']) + params['b2']
A2 = relu(Z2)
Z3 = np.dot(A2, params['W3']) + params['b3']
A3 = relu(Z3)
Z4 = np.dot(A3, params['W4']) + params['b4']
A4 = softmax(Z4)
return A4, {'Z1': Z1, 'A1': A1, 'Z2': Z2, 'A2': A2, 'Z3': Z3, 'A3': A3, 'Z4': Z4, 'A4': A4}
##################################
# Defining the Cross-entropy loss
##################################
def cross_entropy_loss(y_pred, y_true):
m = y_true.shape[0]
log_likelihood = -np.log(y_pred[range(m), y_true])
loss = np.sum(log_likelihood) / m
return loss
##################################
# Defining the Backpropagation algorithm
##################################
def backward(X, y_true, params, cache):
m = y_true.shape[0]
dZ4 = cache['A4'] - np.eye(output_dim)[y_true]
dW4 = np.dot(cache['A3'].T, dZ4) / m
db4 = np.sum(dZ4, axis=0) / m
dA3 = np.dot(dZ4, params['W4'].T)
dZ3 = dA3 * (cache['Z3'] > 0)
dW3 = np.dot(cache['A2'].T, dZ3) / m
db3 = np.sum(dZ3, axis=0) / m
dA2 = np.dot(dZ3, params['W3'].T)
dZ2 = dA2 * (cache['Z2'] > 0)
dW2 = np.dot(cache['A1'].T, dZ2) / m
db2 = np.sum(dZ2, axis=0) / m
dA1 = np.dot(dZ2, params['W2'].T)
dZ1 = dA1 * (cache['Z1'] > 0)
dW1 = np.dot(X.T, dZ1) / m
db1 = np.sum(dZ1, axis=0) / m
gradients = {'dW1': dW1, 'db1': db1, 'dW2': dW2, 'db2': db2, 'dW3': dW3, 'db3': db3, 'dW4': dW4, 'db4': db4}
return gradients
##################################
# Defining the function to implement
# Stochastic Gradient Descent Optimization
##################################
def sgd(params, gradients, learning_rate):
for param_name in params:
params[param_name] -= learning_rate * gradients['d' + param_name]
##################################
# Defining the function to implement
# model training
##################################
def train(X, y, params, epochs, learning_rate, optimizer):
costs = []
accuracies = []
for epoch in range(epochs):
# Performing forward pass
y_pred, cache = forward(X, params)
# Computing loss
loss = cross_entropy_loss(y_pred, y)
costs.append(loss)
# Computing accuracy
accuracy = np.mean(np.argmax(y_pred, axis=1) == y)
accuracies.append(accuracy)
# Performing backpropagation
gradients = backward(X, y, params, cache)
# Updating the parameters using the specified optimizer
if optimizer == 'SGD':
sgd(params, gradients, learning_rate)
elif optimizer == 'ADAM':
adam(params, gradients, learning_rate)
elif optimizer == 'ADAGRAD':
adagrad(params, gradients, learning_rate)
elif optimizer == 'ADADELTA':
adadelta(params, gradients)
elif optimizer == 'LION':
lion(params, gradients, learning_rate)
elif optimizer == 'RMSPROP':
rmsprop(params, gradients, learning_rate)
# Printing model iteration progress
if epoch % 100 == 0:
print(f'Epoch {epoch}: Loss={loss}, Accuracy={accuracy}')
return costs, accuracies
##################################
# Defining model training parameters
##################################
epochs = 1001
learning_rate = 0.01
##################################
# Implementing the method on
# Stochastic Gradient Descent Optimization
##################################
optimizers = ['SGD']
all_costs = {}
all_accuracies = {}
for optimizer in optimizers:
params_copy = params.copy()
costs, accuracies = train(matrix_x_values, y_values, params_copy, epochs, learning_rate, optimizer)
all_costs[optimizer] = costs
all_accuracies[optimizer] = accuracies
Epoch 0: Loss=0.977656026481093, Accuracy=0.5214723926380368 Epoch 100: Loss=0.3327003767656589, Accuracy=0.8895705521472392 Epoch 200: Loss=0.2626323498036725, Accuracy=0.901840490797546 Epoch 300: Loss=0.23435760093994626, Accuracy=0.901840490797546 Epoch 400: Loss=0.21806821157745296, Accuracy=0.9079754601226994 Epoch 500: Loss=0.20740502854560894, Accuracy=0.9263803680981595 Epoch 600: Loss=0.20011464071141585, Accuracy=0.9263803680981595 Epoch 700: Loss=0.1949784687584053, Accuracy=0.9263803680981595 Epoch 800: Loss=0.1914926476761292, Accuracy=0.9263803680981595 Epoch 900: Loss=0.188994671488959, Accuracy=0.9263803680981595 Epoch 1000: Loss=0.1871343830968314, Accuracy=0.9263803680981595
##################################
# Plotting the cost against iterations for
# Stochastic Gradient Descent Optimization
##################################
plt.figure(figsize=(10, 6))
for optimizer in optimizers:
plt.plot(range(epochs), all_costs[optimizer], label=optimizer)
plt.xlabel('Iterations')
plt.ylabel('Cost')
plt.title('SGD Optimization: Cost Function by Iteration')
plt.ylim(0.15, 0.30)
plt.xlim(-50,1000)
plt.legend([], [], frameon=False)
plt.show()
##################################
# Plotting the classification accuracy against iterations for
# Stochastic Gradient Descent Optimization
##################################
plt.figure(figsize=(10, 6))
for optimizer in optimizers:
plt.plot(range(epochs), all_accuracies[optimizer], label=optimizer)
plt.xlabel('Iterations')
plt.ylabel('Accuracy')
plt.title('SGD Optimization: : Classification by Iteration')
plt.ylim(0.00, 1.00)
plt.xlim(-50,1000)
plt.legend([], [], frameon=False)
plt.show()
##################################
# Gathering the final accuracy and cost values for
# Stochastic Gradient Descent Optimization
##################################
SGD_metrics = pd.DataFrame(["ACCURACY","LOSS"])
SGD_values = pd.DataFrame([accuracies[-1],costs[-1]])
SGD_method = pd.DataFrame(["SGD"]*2)
SGD_summary = pd.concat([SGD_metrics,
SGD_values,
SGD_method], axis=1)
SGD_summary.columns = ['Metric', 'Value', 'Method']
SGD_summary.reset_index(inplace=True, drop=True)
display(SGD_summary)
Metric | Value | Method | |
---|---|---|---|
0 | ACCURACY | 0.926380 | SGD |
1 | LOSS | 0.187134 | SGD |
1.6.3 Adaptive Moment Estimation Optimization ¶
Backpropagation and Weight Update, in the context of an artificial neural network, involve the process of iteratively adjusting the weights of the connections between neurons in the network to minimize the difference between the predicted and the actual target responses. Input data is fed into the neural network, and it propagates through the network layer by layer, starting from the input layer, through hidden layers, and ending at the output layer. At each neuron, the weighted sum of inputs is calculated, followed by the application of an activation function to produce the neuron's output. Once the forward pass is complete, the network's output is compared to the actual target output. The difference between the predicted output and the actual output is quantified using a loss function, which measures the discrepancy between the predicted and actual values. Common loss functions for classification tasks include cross-entropy loss. During the backward pass, the error is propagated backward through the network to compute the gradients of the loss function with respect to each weight in the network. This is achieved using the chain rule of calculus, which allows the error to be decomposed and distributed backward through the network. The gradients quantify how much a change in each weight would affect the overall error of the network. Once the gradients are computed, the weights are updated in the opposite direction of the gradient to minimize the error. This update is typically performed using an optimization algorithm such as gradient descent, which adjusts the weights in proportion to their gradients and a learning rate hyperparameter. The learning rate determines the size of the step taken in the direction opposite to the gradient. These steps are repeated for multiple iterations (epochs) over the training data. As the training progresses, the weights are adjusted iteratively to minimize the error, leading to a neural network model that accurately classifies input data. This activation function is commonly used in the output layer for binary classification problems.
Optimization Algorithms, in the context of neural network classification, are methods used to adjust the parameters (weights and biases) of a neural network during the training process in order to minimize a predefined loss function. The primary goal of these algorithms is to optimize the performance of the neural network by iteratively updating its parameters based on the feedback provided by the training data. Optimization algorithms play a critical role in the training of neural networks because they determine how effectively the network learns from the data and how quickly it converges to an optimal solution. These algorithms are significant during model development in improving model accuracy (optimization algorithms help improve the accuracy of neural network models by minimizing the classification error on the training data), enhancing generalization (by minimizing the loss function during training, optimization algorithms aim to generalize well to unseen data, thereby improving the model's ability to make accurate predictions on new inputs), reducing training time (efficient optimization algorithms can accelerate the convergence of the training process, leading to shorter training times for neural networks), handling complex data (since neural networks often deal with high-dimensional and non-linear data, optimization algorithms enable neural networks to effectively learn complex patterns and relationships within the data, leading to improved classification performance) and adapting to variations in data (optimization algorithms can adapt the model's parameters based on variations in the training data, ensuring robustness and stability in the face of different input distributions or data characteristics).
Adaptive Moment Estimation Optimization (Adam) combines both momentum-based methods and adaptive learning rate methods by maintaining exponentially decaying moving averages of past gradients and their squares, which are then used to adaptively adjust the learning rates for each parameter. The Adam process involves initializing the parameters, including the first and second moment estimates (m and v) to zero. In each iteration of training which is repeated until convergence or a predetermined number of iterations, the gradients of the loss function with respect to the parameters are computed, the biased first and second moment estimates are sequentially determined, the bias in the first and second moment estimates are corrected and the model parameters are subsequently updated. Adam demonstrates several advantages over other optimization methods in terms of adaptive learning rates (Adam adapts the learning rates for each parameter individually, making it less sensitive to manual tuning of learning rate hyperparameters compared to SGD and RMSprop), efficient convergence (Adam often converges faster than SGD and RMSprop, especially in the presence of sparse gradients or non-stationary objectives.), and robustness to noisy gradients (Adam's adaptive learning rate mechanism and momentum-like updates make it more robust to noisy gradients compared to SGD and AdaGrad.). However, some disadvantages of Adam include memory and computational cost (Adam requires additional memory and computation to maintain the moving average estimates of the gradients and their squares. This can increase the computational overhead, especially for large-scale neural networks.), sensitivity to hyperparameters (although Adam is less sensitive to learning rate hyperparameters compared to SGD, it still requires tuning of other hyperparameters such as the momentum parameters, and potential overfitting (In some cases, Adam may exhibit aggressive updates, leading to overfitting, especially when the momentum parameters are not properly tuned.).
- A neural network with the following structure was formulated:
- Hidden Layer = 3
- Number of Nodes per Hidden Layer = 5
- The backpropagation and optimization algorithms were implemented with parameter settings described as follows:
- Learning Rate = 0.01
- Epochs = 1000
- Hidden Layer Activation Function = Sigmoid Activation Function
- Output Layer Activation Function = Softmax Activation Function
- Loss Function Optimization Method = Adaptive Moment Estimation (ADAM)
- Beta1 (Exponential Decay Rate for First Moment) = 0.900
- Beta2 (Exponential Decay Rate for Second Moment) = 0.999
- Epsilon (Constant to Maintain Numerical Stability During Update) = 1e-8
- The final loss estimate determined as 0.17663 at the 1000th epoch was not optimally low as compared to those obtained using the other optimization methods.
- Applying parameter updates using an ADAM cost function optimization, the neural network model performance is estimated as follows:
- Accuracy = 91.41104
- The estimated classification accuracy using the ADAM cost function optimization was not optimal as compared to those obtained using the other optimization methods.
##################################
# Defining the function to implement
# Adaptive Moment Estimation Optimization
##################################
def adam(params, gradients, learning_rate, m=None, v=None, beta1=0.9, beta2=0.999, eps=1e-8, t=0):
if m is None:
m = {k: np.zeros_like(v) for k, v in params.items()}
if v is None:
v = {k: np.zeros_like(v) for k, v in params.items()}
t += 1
for param_name in params:
m[param_name] = beta1 * m[param_name] + (1 - beta1) * gradients['d' + param_name]
v[param_name] = beta2 * v[param_name] + (1 - beta2) * (gradients['d' + param_name] ** 2)
m_hat = m[param_name] / (1 - beta1 ** t)
v_hat = v[param_name] / (1 - beta2 ** t)
params[param_name] -= learning_rate * m_hat / (np.sqrt(v_hat) + eps)
##################################
# Defining model training parameters
##################################
epochs = 1001
learning_rate = 0.01
##################################
# Implementing the method on
# Adaptive Moment Estimation Optimization
##################################
optimizers = ['ADAM']
all_costs = {}
all_accuracies = {}
for optimizer in optimizers:
params_copy = params.copy()
costs, accuracies = train(matrix_x_values, y_values, params_copy, epochs, learning_rate, optimizer)
all_costs[optimizer] = costs
all_accuracies[optimizer] = accuracies
Epoch 0: Loss=0.18711788665783202, Accuracy=0.9263803680981595 Epoch 100: Loss=0.18001254333263375, Accuracy=0.9141104294478528 Epoch 200: Loss=0.17884340079313119, Accuracy=0.9141104294478528 Epoch 300: Loss=0.17809955980988293, Accuracy=0.9141104294478528 Epoch 400: Loss=0.17764474833280816, Accuracy=0.9141104294478528 Epoch 500: Loss=0.1773116214774388, Accuracy=0.9141104294478528 Epoch 600: Loss=0.177062185019735, Accuracy=0.9141104294478528 Epoch 700: Loss=0.17687868107722884, Accuracy=0.9141104294478528 Epoch 800: Loss=0.17674860655817845, Accuracy=0.9141104294478528 Epoch 900: Loss=0.17669745796295783, Accuracy=0.9141104294478528 Epoch 1000: Loss=0.17663295512555574, Accuracy=0.9141104294478528
##################################
# Plotting the cost against iterations for
# Adaptive Moment Estimation Optimization
##################################
plt.figure(figsize=(10, 6))
for optimizer in optimizers:
plt.plot(range(epochs), all_costs[optimizer], label=optimizer)
plt.xlabel('Iterations')
plt.ylabel('Cost')
plt.title('ADAM Optimization: Cost Function by Iteration')
plt.ylim(0.15, 0.30)
plt.xlim(-50,1000)
plt.legend([], [], frameon=False)
plt.show()
##################################
# Plotting the classification accuracy against iterations for
# Adaptive Moment Estimation Optimization
##################################
plt.figure(figsize=(10, 6))
for optimizer in optimizers:
plt.plot(range(epochs), all_accuracies[optimizer], label=optimizer)
plt.xlabel('Iterations')
plt.ylabel('Accuracy')
plt.title('ADAM Optimization: : Classification by Iteration')
plt.ylim(0.00, 1.00)
plt.xlim(-50,1000)
plt.legend([], [], frameon=False)
plt.show()
##################################
# Gathering the final accuracy and cost values for
# Adaptive Moment Estimation Optimization
##################################
ADAM_metrics = pd.DataFrame(["ACCURACY","LOSS"])
ADAM_values = pd.DataFrame([accuracies[-1],costs[-1]])
ADAM_method = pd.DataFrame(["ADAM"]*2)
ADAM_summary = pd.concat([ADAM_metrics,
ADAM_values,
ADAM_method], axis=1)
ADAM_summary.columns = ['Metric', 'Value', 'Method']
ADAM_summary.reset_index(inplace=True, drop=True)
display(ADAM_summary)
Metric | Value | Method | |
---|---|---|---|
0 | ACCURACY | 0.914110 | ADAM |
1 | LOSS | 0.176633 | ADAM |
1.6.4 Adaptive Gradient Algorithm Optimization ¶
Backpropagation and Weight Update, in the context of an artificial neural network, involve the process of iteratively adjusting the weights of the connections between neurons in the network to minimize the difference between the predicted and the actual target responses. Input data is fed into the neural network, and it propagates through the network layer by layer, starting from the input layer, through hidden layers, and ending at the output layer. At each neuron, the weighted sum of inputs is calculated, followed by the application of an activation function to produce the neuron's output. Once the forward pass is complete, the network's output is compared to the actual target output. The difference between the predicted output and the actual output is quantified using a loss function, which measures the discrepancy between the predicted and actual values. Common loss functions for classification tasks include cross-entropy loss. During the backward pass, the error is propagated backward through the network to compute the gradients of the loss function with respect to each weight in the network. This is achieved using the chain rule of calculus, which allows the error to be decomposed and distributed backward through the network. The gradients quantify how much a change in each weight would affect the overall error of the network. Once the gradients are computed, the weights are updated in the opposite direction of the gradient to minimize the error. This update is typically performed using an optimization algorithm such as gradient descent, which adjusts the weights in proportion to their gradients and a learning rate hyperparameter. The learning rate determines the size of the step taken in the direction opposite to the gradient. These steps are repeated for multiple iterations (epochs) over the training data. As the training progresses, the weights are adjusted iteratively to minimize the error, leading to a neural network model that accurately classifies input data.
Optimization Algorithms, in the context of neural network classification, are methods used to adjust the parameters (weights and biases) of a neural network during the training process in order to minimize a predefined loss function. The primary goal of these algorithms is to optimize the performance of the neural network by iteratively updating its parameters based on the feedback provided by the training data. Optimization algorithms play a critical role in the training of neural networks because they determine how effectively the network learns from the data and how quickly it converges to an optimal solution. These algorithms are significant during model development in improving model accuracy (optimization algorithms help improve the accuracy of neural network models by minimizing the classification error on the training data), enhancing generalization (by minimizing the loss function during training, optimization algorithms aim to generalize well to unseen data, thereby improving the model's ability to make accurate predictions on new inputs), reducing training time (efficient optimization algorithms can accelerate the convergence of the training process, leading to shorter training times for neural networks), handling complex data (since neural networks often deal with high-dimensional and non-linear data, optimization algorithms enable neural networks to effectively learn complex patterns and relationships within the data, leading to improved classification performance) and adapting to variations in data (optimization algorithms can adapt the model's parameters based on variations in the training data, ensuring robustness and stability in the face of different input distributions or data characteristics).
Adaptive Gradient Algorithm Optimization (AdaGrad) adapts the learning rates of individual parameters based on the historical gradient information. The main idea behind AdaGrad is to decrease the learning rate for parameters that have been updated frequently and increase the learning rate for parameters that have been updated infrequently. The AdaGrad process involves initializing the parameters, including the squared gradient accumulation variable (denoted as "cache"), to a small positive value. In each iteration of training which is repeated until convergence or a predetermined number of iterations, the gradients of the loss function with respect to the parameters are computed, the squared gradient accumulation variable is determined, and the parameters are subsequently updated using the accumulated squared gradients. AdaGrad demonstrates several advantages over other optimization methods in terms of adaptive learning rates (AdaGrad adapts the learning rates for each parameter individually, making it less sensitive to manual tuning of learning rate hyperparameters compared to SGD.), efficient handling of sparse data (AdaGrad performs well in scenarios where the data is sparse or features have varying importance. It adjusts the learning rates based on the accumulated gradients, which helps handle such data efficiently.), and quick convergence (AdaGrad often converges quickly, especially in settings where the learning rates need to be adjusted dynamically based on the gradients' characteristics. This efficiency can lead to faster convergence compared to SGD.). However, some disadvantages of AdaGrad include diminishing learning rates (AdaGrad's accumulation of squared gradients can lead to diminishing learning rates over time. As the accumulation increases, the learning rates for parameters may become very small, which can slow down the learning process, especially in later stages of training.), memory and computational cost (AdaGrad requires additional memory and computation to store and update the accumulated squared gradients for each parameter. This overhead can become significant for large-scale neural networks with a high number of parameters.), and potential oversensitivity to initial learning rate (AdaGrad's performance can be sensitive to the initial learning rate setting. If the initial learning rate is too high, AdaGrad may converge prematurely, while a too low initial learning rate can lead to slow convergence.).
- A neural network with the following structure was formulated:
- Hidden Layer = 3
- Number of Nodes per Hidden Layer = 5
- The backpropagation and optimization algorithms were implemented with parameter settings described as follows:
- Learning Rate = 0.01
- Epochs = 1000
- Hidden Layer Activation Function = Sigmoid Activation Function
- Output Layer Activation Function = Softmax Activation Function
- Loss Function Optimization Method = Adaptive Gradient Algorithm (ADAGRAD)
- Epsilon (Constant to Maintain Numerical Stability During Update) = 1e-8
- The final loss estimate determined as 0.17211 at the 1000th epoch was not optimally low as compared to those obtained using the other optimization methods.
- Applying parameter updates using an ADAGRAD cost function optimization, the neural network model performance is estimated as follows:
- Accuracy = 92.63803
- The estimated classification accuracy using the ADAGRAD cost function optimization was not optimal as compared to those obtained using the other optimization methods.
##################################
# Defining the function to implement
# Adaptive Gradient Algorithm Optimization
##################################
def adagrad(params, gradients, learning_rate, cache=None, eps=1e-8):
if cache is None:
cache = {key: np.zeros_like(value) for key, value in params.items()}
for key in params.keys():
cache[key] += gradients['d' + key] ** 2
params[key] -= learning_rate * gradients['d' + key] / (np.sqrt(cache[key]) + eps)
##################################
# Defining model training parameters
##################################
epochs = 1001
learning_rate = 0.01
##################################
# Implementing the method on
# Adaptive Gradient Algorithm Optimization
##################################
optimizers = ['ADAGRAD']
all_costs = {}
all_accuracies = {}
for optimizer in optimizers:
params_copy = params.copy()
costs, accuracies = train(matrix_x_values, y_values, params_copy, epochs, learning_rate, optimizer)
all_costs[optimizer] = costs
all_accuracies[optimizer] = accuracies
Epoch 0: Loss=0.17255033561236316, Accuracy=0.9263803680981595 Epoch 100: Loss=0.17225880365181573, Accuracy=0.9263803680981595 Epoch 200: Loss=0.17214805276539966, Accuracy=0.9263803680981595 Epoch 300: Loss=0.17214368549141762, Accuracy=0.9263803680981595 Epoch 400: Loss=0.17213930321391002, Accuracy=0.9263803680981595 Epoch 500: Loss=0.17213491306759762, Accuracy=0.9263803680981595 Epoch 600: Loss=0.1721305166509117, Accuracy=0.9263803680981595 Epoch 700: Loss=0.17212611430185765, Accuracy=0.9263803680981595 Epoch 800: Loss=0.17212170607483818, Accuracy=0.9263803680981595 Epoch 900: Loss=0.17211729196072975, Accuracy=0.9263803680981595 Epoch 1000: Loss=0.17211287193611646, Accuracy=0.9263803680981595
##################################
# Plotting the cost against iterations for
# Adaptive Gradient Algorithm Optimization
##################################
plt.figure(figsize=(10, 6))
for optimizer in optimizers:
plt.plot(range(epochs), all_costs[optimizer], label=optimizer)
plt.xlabel('Iterations')
plt.ylabel('Cost')
plt.title('ADAGRAD Optimization: Cost Function by Iteration')
plt.ylim(0.15, 0.30)
plt.xlim(-50,1000)
plt.legend([], [], frameon=False)
plt.show()
##################################
# Plotting the classification accuracy against iterations for
# Adaptive Gradient Algorithm Optimization
##################################
plt.figure(figsize=(10, 6))
for optimizer in optimizers:
plt.plot(range(epochs), all_accuracies[optimizer], label=optimizer)
plt.xlabel('Iterations')
plt.ylabel('Accuracy')
plt.title('ADAGRAD Optimization: : Classification by Iteration')
plt.ylim(0.00, 1.00)
plt.xlim(-50,1000)
plt.legend([], [], frameon=False)
plt.show()
##################################
# Gathering the final accuracy and cost values for
# Adaptive Gradient Algorithm Optimization
##################################
ADAGRAD_metrics = pd.DataFrame(["ACCURACY","LOSS"])
ADAGRAD_values = pd.DataFrame([accuracies[-1],costs[-1]])
ADAGRAD_method = pd.DataFrame(["ADAGRAD"]*2)
ADAGRAD_summary = pd.concat([ADAGRAD_metrics,
ADAGRAD_values,
ADAGRAD_method], axis=1)
ADAGRAD_summary.columns = ['Metric', 'Value', 'Method']
ADAGRAD_summary.reset_index(inplace=True, drop=True)
display(ADAGRAD_summary)
Metric | Value | Method | |
---|---|---|---|
0 | ACCURACY | 0.926380 | ADAGRAD |
1 | LOSS | 0.172113 | ADAGRAD |
1.6.5 AdaDelta Optimization ¶
Backpropagation and Weight Update, in the context of an artificial neural network, involve the process of iteratively adjusting the weights of the connections between neurons in the network to minimize the difference between the predicted and the actual target responses. Input data is fed into the neural network, and it propagates through the network layer by layer, starting from the input layer, through hidden layers, and ending at the output layer. At each neuron, the weighted sum of inputs is calculated, followed by the application of an activation function to produce the neuron's output. Once the forward pass is complete, the network's output is compared to the actual target output. The difference between the predicted output and the actual output is quantified using a loss function, which measures the discrepancy between the predicted and actual values. Common loss functions for classification tasks include cross-entropy loss. During the backward pass, the error is propagated backward through the network to compute the gradients of the loss function with respect to each weight in the network. This is achieved using the chain rule of calculus, which allows the error to be decomposed and distributed backward through the network. The gradients quantify how much a change in each weight would affect the overall error of the network. Once the gradients are computed, the weights are updated in the opposite direction of the gradient to minimize the error. This update is typically performed using an optimization algorithm such as gradient descent, which adjusts the weights in proportion to their gradients and a learning rate hyperparameter. The learning rate determines the size of the step taken in the direction opposite to the gradient. These steps are repeated for multiple iterations (epochs) over the training data. As the training progresses, the weights are adjusted iteratively to minimize the error, leading to a neural network model that accurately classifies input data.
Optimization Algorithms, in the context of neural network classification, are methods used to adjust the parameters (weights and biases) of a neural network during the training process in order to minimize a predefined loss function. The primary goal of these algorithms is to optimize the performance of the neural network by iteratively updating its parameters based on the feedback provided by the training data. Optimization algorithms play a critical role in the training of neural networks because they determine how effectively the network learns from the data and how quickly it converges to an optimal solution. These algorithms are significant during model development in improving model accuracy (optimization algorithms help improve the accuracy of neural network models by minimizing the classification error on the training data), enhancing generalization (by minimizing the loss function during training, optimization algorithms aim to generalize well to unseen data, thereby improving the model's ability to make accurate predictions on new inputs), reducing training time (efficient optimization algorithms can accelerate the convergence of the training process, leading to shorter training times for neural networks), handling complex data (since neural networks often deal with high-dimensional and non-linear data, optimization algorithms enable neural networks to effectively learn complex patterns and relationships within the data, leading to improved classification performance) and adapting to variations in data (optimization algorithms can adapt the model's parameters based on variations in the training data, ensuring robustness and stability in the face of different input distributions or data characteristics).
AdaDelta Optimization (AdaDelta) is an extension of AdaGrad and addresses its limitation of diminishing learning rates over time. AdaDelta dynamically adapts the learning rates based on a moving average of past gradients and updates without the need for an explicit learning rate parameter. The AdaDelta process involves initializing the parameters, including the moving average variables for the gradient and the parameter update to zero, and setting a decay rate parameter. In each iteration of training which is repeated until convergence or a predetermined number of iterations, the gradients of the loss function with respect to the parameters are computed, the moving average variables are estimated, the parameter updates based on a moving average of past updates are determined, and the model parameters with their associated moving averages are calculated. AdaDelta demonstrates several advantages over other optimization methods in terms of no manual learning rate tuning (AdaDelta eliminates the need for manually tuning learning rate hyperparameters, making it more user-friendly and robust to variations in data and architectures compared to methods like SGD and Adam.), efficient handling of sparse gradients (AdaDelta performs well in scenarios where the gradients are sparse or have varying magnitudes. Its adaptive learning rate mechanism allows it to handle such gradients efficiently, leading to improved optimization performance.), and no diminishing learning rates (AdaDelta addresses the issue of diminishing learning rates over time, which can occur in AdaGrad and RMSprop. By incorporating a moving average of past updates, AdaDelta ensures that the learning rates remain relevant throughout the training process.). However, some disadvantages of AdaDelta include memory and computational cost (AdaDelta requires additional memory and computation to store and update the moving averages of gradients and updates. This overhead can become significant for large-scale neural networks with a high number of parameters.), sensitivity to decay rate parameter (the performance of AdaDelta can be sensitive to the choice of the decay rate parameter. Setting this parameter too low may result in slow convergence, while setting it too high may lead to instability or oscillations in the optimization process.), and potential overshooting (AdaDelta's reliance on a moving average of past updates may lead to overshooting or oscillations in the optimization process, especially in scenarios with highly non-convex objectives or noisy gradients.).
- A neural network with the following structure was formulated:
- Hidden Layer = 3
- Number of Nodes per Hidden Layer = 5
- The backpropagation and optimization algorithms were implemented with parameter settings described as follows:
- Learning Rate = 0.01
- Epochs = 1000
- Hidden Layer Activation Function = Sigmoid Activation Function
- Output Layer Activation Function = Softmax Activation Function
- Loss Function Optimization Method = AdaDelta Optimization (ADADELTA)
- Rho (Exponential Decay of Accumulated Past Gradients) = 0.900
- Epsilon (Constant to Maintain Numerical Stability During Update) = 1e-8
- The final loss estimate determined as 0.15831 at the 1000th epoch was not optimally low as compared to those obtained using the other optimization methods.
- Applying parameter updates using an ADADELTA cost function optimization, the neural network model performance is estimated as follows:
- Accuracy = 92.63803
- The estimated classification accuracy using the ADADELTA cost function optimization was not optimal as compared to those obtained using the other optimization methods.
##################################
# Defining the function to implement
# AdaDelta Optimization
##################################
def adadelta(params, gradients, cache=None, rho=0.9, eps=1e-8):
if cache is None:
cache = {key: np.zeros_like(value) for key, value in params.items()}
delta = {key: np.zeros_like(value) for key, value in params.items()}
for key in params.keys():
cache[key] = rho * cache[key] + (1 - rho) * (gradients['d' + key] ** 2)
delta[key] = -np.sqrt(delta[key] + eps) * gradients['d' + key] / np.sqrt(cache[key] + eps)
params[key] += delta[key]
delta[key] = rho * delta[key] + (1 - rho) * (delta[key] ** 2)
##################################
# Defining model training parameters
##################################
epochs = 1001
learning_rate = 0.01
##################################
# Implementing the method on
# AdaDelta Optimization
##################################
optimizers = ['ADADELTA']
all_costs = {}
all_accuracies = {}
for optimizer in optimizers:
params_copy = params.copy()
costs, accuracies = train(matrix_x_values, y_values, params_copy, epochs, learning_rate, optimizer)
all_costs[optimizer] = costs
all_accuracies[optimizer] = accuracies
Epoch 0: Loss=0.1765339451338176, Accuracy=0.9141104294478528 Epoch 100: Loss=0.16879310577636317, Accuracy=0.9263803680981595 Epoch 200: Loss=0.16724377618147845, Accuracy=0.9263803680981595 Epoch 300: Loss=0.16595131396735266, Accuracy=0.9263803680981595 Epoch 400: Loss=0.16465897862561618, Accuracy=0.9263803680981595 Epoch 500: Loss=0.16365317756183773, Accuracy=0.9263803680981595 Epoch 600: Loss=0.1627767149760552, Accuracy=0.9263803680981595 Epoch 700: Loss=0.1615633233464469, Accuracy=0.9263803680981595 Epoch 800: Loss=0.16044848109322676, Accuracy=0.9263803680981595 Epoch 900: Loss=0.15932663270830294, Accuracy=0.9263803680981595 Epoch 1000: Loss=0.15831221856929253, Accuracy=0.9263803680981595
##################################
# Plotting the cost against iterations for
# AdaDelta Optimization
##################################
plt.figure(figsize=(10, 6))
for optimizer in optimizers:
plt.plot(range(epochs), all_costs[optimizer], label=optimizer)
plt.xlabel('Iterations')
plt.ylabel('Cost')
plt.title('ADADELTA Optimization: Cost Function by Iteration')
plt.ylim(0.15, 0.30)
plt.xlim(-50,1000)
plt.legend([], [], frameon=False)
plt.show()
##################################
# Plotting the classification accuracy against iterations for
# AdaDelta Optimization
##################################
plt.figure(figsize=(10, 6))
for optimizer in optimizers:
plt.plot(range(epochs), all_accuracies[optimizer], label=optimizer)
plt.xlabel('Iterations')
plt.ylabel('Accuracy')
plt.title('ADADELTA Optimization: : Classification by Iteration')
plt.ylim(0.00, 1.00)
plt.xlim(-50,1000)
plt.legend([], [], frameon=False)
plt.show()
##################################
# Gathering the final accuracy and cost values for
# AdaDelta Optimization
##################################
ADADELTA_metrics = pd.DataFrame(["ACCURACY","LOSS"])
ADADELTA_values = pd.DataFrame([accuracies[-1],costs[-1]])
ADADELTA_method = pd.DataFrame(["ADADELTA"]*2)
ADADELTA_summary = pd.concat([ADADELTA_metrics,
ADADELTA_values,
ADADELTA_method], axis=1)
ADADELTA_summary.columns = ['Metric', 'Value', 'Method']
ADADELTA_summary.reset_index(inplace=True, drop=True)
display(ADADELTA_summary)
Metric | Value | Method | |
---|---|---|---|
0 | ACCURACY | 0.926380 | ADADELTA |
1 | LOSS | 0.158312 | ADADELTA |
1.6.6 Layer-wise Optimized Non-convex Optimization ¶
Backpropagation and Weight Update, in the context of an artificial neural network, involve the process of iteratively adjusting the weights of the connections between neurons in the network to minimize the difference between the predicted and the actual target responses. Input data is fed into the neural network, and it propagates through the network layer by layer, starting from the input layer, through hidden layers, and ending at the output layer. At each neuron, the weighted sum of inputs is calculated, followed by the application of an activation function to produce the neuron's output. Once the forward pass is complete, the network's output is compared to the actual target output. The difference between the predicted output and the actual output is quantified using a loss function, which measures the discrepancy between the predicted and actual values. Common loss functions for classification tasks include cross-entropy loss. During the backward pass, the error is propagated backward through the network to compute the gradients of the loss function with respect to each weight in the network. This is achieved using the chain rule of calculus, which allows the error to be decomposed and distributed backward through the network. The gradients quantify how much a change in each weight would affect the overall error of the network. Once the gradients are computed, the weights are updated in the opposite direction of the gradient to minimize the error. This update is typically performed using an optimization algorithm such as gradient descent, which adjusts the weights in proportion to their gradients and a learning rate hyperparameter. The learning rate determines the size of the step taken in the direction opposite to the gradient. These steps are repeated for multiple iterations (epochs) over the training data. As the training progresses, the weights are adjusted iteratively to minimize the error, leading to a neural network model that accurately classifies input data.
Optimization Algorithms, in the context of neural network classification, are methods used to adjust the parameters (weights and biases) of a neural network during the training process in order to minimize a predefined loss function. The primary goal of these algorithms is to optimize the performance of the neural network by iteratively updating its parameters based on the feedback provided by the training data. Optimization algorithms play a critical role in the training of neural networks because they determine how effectively the network learns from the data and how quickly it converges to an optimal solution. These algorithms are significant during model development in improving model accuracy (optimization algorithms help improve the accuracy of neural network models by minimizing the classification error on the training data), enhancing generalization (by minimizing the loss function during training, optimization algorithms aim to generalize well to unseen data, thereby improving the model's ability to make accurate predictions on new inputs), reducing training time (efficient optimization algorithms can accelerate the convergence of the training process, leading to shorter training times for neural networks), handling complex data (since neural networks often deal with high-dimensional and non-linear data, optimization algorithms enable neural networks to effectively learn complex patterns and relationships within the data, leading to improved classification performance) and adapting to variations in data (optimization algorithms can adapt the model's parameters based on variations in the training data, ensuring robustness and stability in the face of different input distributions or data characteristics).
Layer-wise Optimized Non-convex Optimization (Lion) focuses on adapting learning rates for each layer of the neural network based on the curvature of the loss landscape. Lion aims to accelerate convergence, improve optimization efficiency, and enhance the overall performance of deep neural networks in classification tasks.The Lion process involves adapting the learning rates for each layer of the neural network independently, allowing it to handle variations in the curvature and scale of the loss landscape across different layers, incorporating momentum-like updates to help accelerate convergence and navigate through the optimization space more efficiently and dynamically adjusting the learning rates based on the curvature of the loss landscape, ensuring that larger updates are made in regions with steep gradients and smaller updates in regions with shallow gradients. Lion demonstrates several advantages over other optimization methods in terms of layer-wise adaptation (Lion's ability to adapt learning rates layer-wise allows it to exploit the local curvature of the loss landscape, leading to more efficient optimization and faster convergence compared to methods with uniform learning rates.), efficient handling of deep architectures (Lion is specifically designed for training deep neural networks and can handle the challenges associated with deep architectures, such as vanishing gradients and optimization instabilities, more effectively than traditional optimization methods.), and enhanced generalization (Lion's adaptive learning rates and momentum-like updates help prevent overfitting and improve the generalization ability of the neural network classifier, leading to better performance on unseen data.). However, some disadvantages of Lion include complexity (Lion may have higher computational and implementation complexity compared to simpler optimization methods like SGD or AdaGrad. It requires careful tuning of hyperparameters and may be more challenging to implement correctly.), sensitivity to hyperparameters (Like many optimization algorithms, Lion's performance can be sensitive to the choice of hyperparameters, including the momentum parameter and the learning rate schedule. Finding the optimal hyperparameters may require extensive experimentation and tuning.), and limited practical evaluation ( Lion is a relatively new optimization algorithm, and its practical performance may not be extensively evaluated or well-understood compared to more established methods like SGD, Adam, or RMSprop.).
- A neural network with the following structure was formulated:
- Hidden Layer = 3
- Number of Nodes per Hidden Layer = 5
- The backpropagation and optimization algorithms were implemented with parameter settings described as follows:
- Learning Rate = 0.01
- Epochs = 1000
- Hidden Layer Activation Function = Sigmoid Activation Function
- Output Layer Activation Function = Softmax Activation Function
- Loss Function Optimization Method = Layer-wise Optimized Non-convex Optimization (LION)
- Gamma (Exponential Decay of Accumulated Past Gradients) = 0.999
- Epsilon (Constant to Maintain Numerical Stability During Update) = 1e-8
- The final loss estimate determined as 0.15060 at the 1000th epoch was optimally low as compared to those obtained using the other optimization methods.
- Applying parameter updates using an LION cost function optimization, the neural network model performance is estimated as follows:
- Accuracy = 93.25153
- The estimated classification accuracy using the LION cost function optimization was optimal as compared to those obtained using the other optimization methods.
##################################
# Defining the function to implement
# Layer-wise Optimized Non-convex Optimization
##################################
def lion(params, gradients, learning_rate, z=None, r=None, gamma=0.999, eps=1e-8):
if z is None:
z = {key: np.zeros_like(value) for key, value in params.items()}
if r is None:
r = {key: np.zeros_like(value) for key, value in params.items()}
for key in params.keys():
z[key] = gamma * z[key] + (1 - gamma) * gradients['d' + key]
r[key] = gamma * r[key] + (1 - gamma) * (gradients['d' + key] ** 2)
delta = - learning_rate * z[key] / np.sqrt(r[key] + eps)
params[key] += delta
##################################
# Defining model training parameters
##################################
epochs = 1001
learning_rate = 0.01
##################################
# Implementing the method on
# Layer-wise Optimized Non-convex Optimization
##################################
optimizers = ['LION']
all_costs = {}
all_accuracies = {}
for optimizer in optimizers:
params_copy = params.copy()
costs, accuracies = train(matrix_x_values, y_values, params_copy, epochs, learning_rate, optimizer)
all_costs[optimizer] = costs
all_accuracies[optimizer] = accuracies
Epoch 0: Loss=0.15830151628894845, Accuracy=0.9263803680981595 Epoch 100: Loss=0.15764517713374532, Accuracy=0.9263803680981595 Epoch 200: Loss=0.1569574448963875, Accuracy=0.9263803680981595 Epoch 300: Loss=0.1562716676289252, Accuracy=0.9263803680981595 Epoch 400: Loss=0.15561303244310098, Accuracy=0.9263803680981595 Epoch 500: Loss=0.1547200958620371, Accuracy=0.9263803680981595 Epoch 600: Loss=0.15388359677979635, Accuracy=0.9263803680981595 Epoch 700: Loss=0.15307060515557244, Accuracy=0.9263803680981595 Epoch 800: Loss=0.152279697937234, Accuracy=0.9263803680981595 Epoch 900: Loss=0.1514504787988413, Accuracy=0.9263803680981595 Epoch 1000: Loss=0.1506013171604087, Accuracy=0.9325153374233128
##################################
# Plotting the cost against iterations for
# Layer-wise Optimized Non-convex Optimization
##################################
plt.figure(figsize=(10, 6))
for optimizer in optimizers:
plt.plot(range(epochs), all_costs[optimizer], label=optimizer)
plt.xlabel('Iterations')
plt.ylabel('Cost')
plt.title('LION Optimization: Cost Function by Iteration')
plt.ylim(0.15, 0.30)
plt.xlim(-50,1000)
plt.legend([], [], frameon=False)
plt.show()
##################################
# Plotting the classification accuracy against iterations for
# Layer-wise Optimized Non-convex Optimization
##################################
plt.figure(figsize=(10, 6))
for optimizer in optimizers:
plt.plot(range(epochs), all_accuracies[optimizer], label=optimizer)
plt.xlabel('Iterations')
plt.ylabel('Accuracy')
plt.title('LION Optimization: : Classification by Iteration')
plt.ylim(0.00, 1.00)
plt.xlim(-50,1000)
plt.legend([], [], frameon=False)
plt.show()
##################################
# Gathering the final accuracy and cost values for
# Layer-wise Optimized Non-convex Optimization
##################################
LION_metrics = pd.DataFrame(["ACCURACY","LOSS"])
LION_values = pd.DataFrame([accuracies[-1],costs[-1]])
LION_method = pd.DataFrame(["LION"]*2)
LION_summary = pd.concat([LION_metrics,
LION_values,
LION_method], axis=1)
LION_summary.columns = ['Metric', 'Value', 'Method']
LION_summary.reset_index(inplace=True, drop=True)
display(LION_summary)
Metric | Value | Method | |
---|---|---|---|
0 | ACCURACY | 0.932515 | LION |
1 | LOSS | 0.150601 | LION |
1.6.7 Root Mean Square Propagation Optimization ¶
Backpropagation and Weight Update, in the context of an artificial neural network, involve the process of iteratively adjusting the weights of the connections between neurons in the network to minimize the difference between the predicted and the actual target responses. Input data is fed into the neural network, and it propagates through the network layer by layer, starting from the input layer, through hidden layers, and ending at the output layer. At each neuron, the weighted sum of inputs is calculated, followed by the application of an activation function to produce the neuron's output. Once the forward pass is complete, the network's output is compared to the actual target output. The difference between the predicted output and the actual output is quantified using a loss function, which measures the discrepancy between the predicted and actual values. Common loss functions for classification tasks include cross-entropy loss. During the backward pass, the error is propagated backward through the network to compute the gradients of the loss function with respect to each weight in the network. This is achieved using the chain rule of calculus, which allows the error to be decomposed and distributed backward through the network. The gradients quantify how much a change in each weight would affect the overall error of the network. Once the gradients are computed, the weights are updated in the opposite direction of the gradient to minimize the error. This update is typically performed using an optimization algorithm such as gradient descent, which adjusts the weights in proportion to their gradients and a learning rate hyperparameter. The learning rate determines the size of the step taken in the direction opposite to the gradient. These steps are repeated for multiple iterations (epochs) over the training data. As the training progresses, the weights are adjusted iteratively to minimize the error, leading to a neural network model that accurately classifies input data.
Optimization Algorithms, in the context of neural network classification, are methods used to adjust the parameters (weights and biases) of a neural network during the training process in order to minimize a predefined loss function. The primary goal of these algorithms is to optimize the performance of the neural network by iteratively updating its parameters based on the feedback provided by the training data. Optimization algorithms play a critical role in the training of neural networks because they determine how effectively the network learns from the data and how quickly it converges to an optimal solution. These algorithms are significant during model development in improving model accuracy (optimization algorithms help improve the accuracy of neural network models by minimizing the classification error on the training data), enhancing generalization (by minimizing the loss function during training, optimization algorithms aim to generalize well to unseen data, thereby improving the model's ability to make accurate predictions on new inputs), reducing training time (efficient optimization algorithms can accelerate the convergence of the training process, leading to shorter training times for neural networks), handling complex data (since neural networks often deal with high-dimensional and non-linear data, optimization algorithms enable neural networks to effectively learn complex patterns and relationships within the data, leading to improved classification performance) and adapting to variations in data (optimization algorithms can adapt the model's parameters based on variations in the training data, ensuring robustness and stability in the face of different input distributions or data characteristics).
Root Mean Square Propagation Optimization (RMSprop) addresses the limitations of AdaGrad, specifically the issue of diminishing learning rates over time, by introducing a decaying average of past squared gradients. RMSprop adjusts the learning rates for each parameter based on the root mean square of the gradients, allowing for more efficient optimization and faster convergence. The RMSprop process involves initializing the parameters, including a decaying average variable for the squared gradients (denoted as "cache") to zero, and setting a decay rate parameter, with value typically close to 1. In each iteration of training which is repeated until convergence or a predetermined number of iterations, the gradients of the loss function with respect to the parameters are computed, the decaying average of squared gradients is determined, and the parameters using the root mean square of the gradients are updated. RMSprop demonstrates several advantages over other optimization methods in terms of adaptive learning rates (RMSprop adapts learning rates for each parameter individually, making it less sensitive to manual tuning of learning rate hyperparameters compared to SGD.), efficient handling of noisy gradients (RMSprop performs well in scenarios with noisy gradients or non-stationary objectives. It adjusts the learning rates based on the root mean square of the gradients, effectively handling such gradients and improving optimization performance.), and prevention of diminishing learning rates (RMSprop prevents the issue of diminishing learning rates over time, which can occur in AdaGrad and RMSprop. This ensures that the learning rates remain relevant throughout the training process, leading to faster convergence and improved optimization efficiency.). However, some disadvantages of RMSprop include memory and computational cost (RMSprop requires additional memory and computation to store and update the decaying average of squared gradients for each parameter. This overhead can become significant for large-scale neural networks with a high number of parameters.), sensitivity to hyperparameters (the performance of RMSprop can be sensitive to the choice of hyperparameters, including the decay rate parameter. Finding the optimal hyperparameters may require extensive experimentation and tuning.), and potential overshooting (RMSprop's reliance on a decaying average of squared gradients may lead to overshooting or oscillations in the optimization process, especially in scenarios with highly non-convex objectives or noisy gradients.).
- A neural network with the following structure was formulated:
- Hidden Layer = 3
- Number of Nodes per Hidden Layer = 5
- The backpropagation and optimization algorithms were implemented with parameter settings described as follows:
- Learning Rate = 0.01
- Epochs = 1000
- Hidden Layer Activation Function = Sigmoid Activation Function
- Output Layer Activation Function = Softmax Activation Function
- Loss Function Optimization Method = Root Mean Square Propagation (RMSPROP)
- Beta (Exponential Decay of the Average of Squared Gradients.) = 0.900
- Epsilon (Constant to Maintain Numerical Stability During Update) = 1e-8
- The final loss estimate determined as 0.18141 at the 1000th epoch was not optimally low as compared to those obtained using the other optimization methods.
- Applying parameter updates using an RMSPROP cost function optimization, the neural network model performance is estimated as follows:
- Accuracy = 92.02453
- The estimated classification accuracy using the RMSPROP cost function optimization was not optimal as compared to those obtained using the other optimization methods.
##################################
# Defining the function to implement
# Root Mean Square Propagation Optimization
##################################
def rmsprop(params, gradients, learning_rate, cache=None, beta=0.9, eps=1e-8):
if cache is None:
cache = {k: np.zeros_like(v) for k, v in params.items()}
for param_name in params:
cache[param_name] = beta * cache[param_name] + (1 - beta) * (gradients['d' + param_name] ** 2)
params[param_name] -= learning_rate * gradients['d' + param_name] / (np.sqrt(cache[param_name]) + eps)
##################################
# Defining model training parameters
##################################
epochs = 1001
learning_rate = 0.01
##################################
# Implementing the method on
# Root Mean Square Propagation Optimization
##################################
optimizers = ['RMSPROP']
all_costs = {}
all_accuracies = {}
for optimizer in optimizers:
params_copy = params.copy()
costs, accuracies = train(matrix_x_values, y_values, params_copy, epochs, learning_rate, optimizer)
all_costs[optimizer] = costs
all_accuracies[optimizer] = accuracies
Epoch 0: Loss=0.1505920847548153, Accuracy=0.9325153374233128 Epoch 100: Loss=0.21546044742901022, Accuracy=0.8957055214723927 Epoch 200: Loss=0.21991542558634863, Accuracy=0.8957055214723927 Epoch 300: Loss=0.24190084144061264, Accuracy=0.8895705521472392 Epoch 400: Loss=0.19751969470152692, Accuracy=0.901840490797546 Epoch 500: Loss=0.16744501199503517, Accuracy=0.9202453987730062 Epoch 600: Loss=0.1640727702685708, Accuracy=0.9202453987730062 Epoch 700: Loss=0.17830616454468248, Accuracy=0.9202453987730062 Epoch 800: Loss=0.17992251695795375, Accuracy=0.9202453987730062 Epoch 900: Loss=0.18085626064293414, Accuracy=0.9202453987730062 Epoch 1000: Loss=0.1814133365399543, Accuracy=0.9202453987730062
##################################
# Plotting the cost against iterations for
# Root Mean Square Propagation Optimization
##################################
plt.figure(figsize=(10, 6))
for optimizer in optimizers:
plt.plot(range(epochs), all_costs[optimizer], label=optimizer)
plt.xlabel('Iterations')
plt.ylabel('Cost')
plt.title('RMSPROP Optimization: Cost Function by Iteration')
plt.ylim(0.15, 0.30)
plt.xlim(-50,1000)
plt.legend([], [], frameon=False)
plt.show()
##################################
# Plotting the classification accuracy against iterations for
# Root Mean Square Propagation Optimization
##################################
plt.figure(figsize=(10, 6))
for optimizer in optimizers:
plt.plot(range(epochs), all_accuracies[optimizer], label=optimizer)
plt.xlabel('Iterations')
plt.ylabel('Accuracy')
plt.title('RMSPROP Optimization: : Classification by Iteration')
plt.ylim(0.00, 1.00)
plt.xlim(-50,1000)
plt.legend([], [], frameon=False)
plt.show()
##################################
# Gathering the final accuracy and cost values for
# Root Mean Square Propagation Optimization
##################################
RMSPROP_metrics = pd.DataFrame(["ACCURACY","LOSS"])
RMSPROP_values = pd.DataFrame([accuracies[-1],costs[-1]])
RMSPROP_method = pd.DataFrame(["RMSPROP"]*2)
RMSPROP_summary = pd.concat([RMSPROP_metrics,
RMSPROP_values,
RMSPROP_method], axis=1)
RMSPROP_summary.columns = ['Metric', 'Value', 'Method']
RMSPROP_summary.reset_index(inplace=True, drop=True)
display(RMSPROP_summary)
Metric | Value | Method | |
---|---|---|---|
0 | ACCURACY | 0.920245 | RMSPROP |
1 | LOSS | 0.181413 | RMSPROP |
1.7. Consolidated Findings ¶
- While all models showed comparably high classification accuracy, this optimization algorithm model demonstrated the lowest estimated cost values leading to the best discrimination between the dichotomous response.
- LION = Layer-wise Optimized Non-convex Optimization
- The choice of Optimization Algorithm can have a significant impact on the performance and training dynamics of a neural network classification model in terms of generalization ability, convergence speed, noise robustness, learning rate sensitivity, computational efficiency and training stability. The most appropriate algorithm should be carefully considered based on the specific characteristics of the dataset, model architecture, computational resources, and desired training objectives. Experimentation and empirical validation are often necessary to determine the most suitable optimization algorithm for a given neural network classification task.
##################################
# Consolidating all the
# model performance metrics
##################################
model_performance_comparison = pd.concat([SGD_summary,
ADAM_summary,
ADAGRAD_summary,
ADADELTA_summary,
LION_summary,
RMSPROP_summary],
ignore_index=True)
print('Neural Network Model Comparison: ')
display(model_performance_comparison)
Neural Network Model Comparison:
Metric | Value | Method | |
---|---|---|---|
0 | ACCURACY | 0.926380 | SGD |
1 | LOSS | 0.187134 | SGD |
2 | ACCURACY | 0.914110 | ADAM |
3 | LOSS | 0.176633 | ADAM |
4 | ACCURACY | 0.926380 | ADAGRAD |
5 | LOSS | 0.172113 | ADAGRAD |
6 | ACCURACY | 0.926380 | ADADELTA |
7 | LOSS | 0.158312 | ADADELTA |
8 | ACCURACY | 0.932515 | LION |
9 | LOSS | 0.150601 | LION |
10 | ACCURACY | 0.920245 | RMSPROP |
11 | LOSS | 0.181413 | RMSPROP |
##################################
# Consolidating the values for the
# accuracy metrics
# for all models
##################################
model_performance_comparison_accuracy = model_performance_comparison[model_performance_comparison['Metric']=='ACCURACY']
model_performance_comparison_accuracy.reset_index(inplace=True, drop=True)
model_performance_comparison_accuracy
Metric | Value | Method | |
---|---|---|---|
0 | ACCURACY | 0.926380 | SGD |
1 | ACCURACY | 0.914110 | ADAM |
2 | ACCURACY | 0.926380 | ADAGRAD |
3 | ACCURACY | 0.926380 | ADADELTA |
4 | ACCURACY | 0.932515 | LION |
5 | ACCURACY | 0.920245 | RMSPROP |
##################################
# Plotting the values for the
# accuracy metrics
# for all models
##################################
fig, ax = plt.subplots(figsize=(7, 7))
accuracy_hbar = ax.barh(model_performance_comparison_accuracy['Method'], model_performance_comparison_accuracy['Value'])
ax.set_xlabel("Accuracy")
ax.set_ylabel("Neural Network Classification Models")
ax.bar_label(accuracy_hbar, fmt='%.5f', padding=-50, color='white', fontweight='bold')
ax.set_xlim(0,1)
plt.show()
##################################
# Consolidating the values for the
# logarithmic loss error metrics
# for all models
##################################
model_performance_comparison_loss = model_performance_comparison[model_performance_comparison['Metric']=='LOSS']
model_performance_comparison_loss.reset_index(inplace=True, drop=True)
model_performance_comparison_loss
Metric | Value | Method | |
---|---|---|---|
0 | LOSS | 0.187134 | SGD |
1 | LOSS | 0.176633 | ADAM |
2 | LOSS | 0.172113 | ADAGRAD |
3 | LOSS | 0.158312 | ADADELTA |
4 | LOSS | 0.150601 | LION |
5 | LOSS | 0.181413 | RMSPROP |
##################################
# Plotting the values for the
# loss error
# for all models
##################################
fig, ax = plt.subplots(figsize=(7, 7))
loss_hbar = ax.barh(model_performance_comparison_loss['Method'], model_performance_comparison_loss['Value'])
ax.set_xlabel("Loss Error")
ax.set_ylabel("Neural Network Classification Models")
ax.bar_label(loss_hbar, fmt='%.5f', padding=-50, color='white', fontweight='bold')
ax.set_xlim(0,0.20)
plt.show()
2. Summary ¶
3. References ¶
- [Book] Deep Learning: A Visual Approach by Andrew Glassner
- [Book] Deep Learning with Python by François Chollet
- [Book] The Elements of Statistical Learning by Trevor Hastie, Robert Tibshirani and Jerome Friedman
- [Book] Data Preparation for Machine Learning: Data Cleaning, Feature Selection, and Data Transforms in Python by Jason Brownlee
- [Book] Feature Engineering and Selection: A Practical Approach for Predictive Models by Max Kuhn and Kjell Johnson
- [Book] Feature Engineering for Machine Learning by Alice Zheng and Amanda Casari
- [Book] Applied Predictive Modeling by Max Kuhn and Kjell Johnson
- [Book] Data Mining: Practical Machine Learning Tools and Techniques by Ian Witten, Eibe Frank, Mark Hall and Christopher Pal
- [Book] Data Cleaning by Ihab Ilyas and Xu Chu
- [Book] Data Wrangling with Python by Jacqueline Kazil and Katharine Jarmul
- [Book] Regression Modeling Strategies by Frank Harrell
- [Python Library API] NumPy by NumPy Team
- [Python Library API] pandas by Pandas Team
- [Python Library API] seaborn by Seaborn Team
- [Python Library API] matplotlib.pyplot by MatPlotLib Team
- [Python Library API] itertools by Python Team
- [Python Library API] operator by Python Team
- [Python Library API] sklearn.experimental by Scikit-Learn Team
- [Python Library API] sklearn.impute by Scikit-Learn Team
- [Python Library API] sklearn.linear_model by Scikit-Learn Team
- [Python Library API] sklearn.preprocessing by Scikit-Learn Team
- [Python Library API] scipy by SciPy Team
- [Article] Step-by-Step Exploratory Data Analysis (EDA) using Python%20with,distributions%20using%20Python%20programming%20language.) by Malamahadevan Mahadevan (Analytics Vidhya)
- [Article] Exploratory Data Analysis in Python — A Step-by-Step Process by Andrea D'Agostino (Towards Data Science)
- [Article] Exploratory Data Analysis with Python by Douglas Rocha (Medium)
- [Article] 4 Ways to Automate Exploratory Data Analysis (EDA) in Python by Abdishakur Hassan (BuiltIn)
- [Article] 10 Things To Do When Conducting Your Exploratory Data Analysis (EDA) by Alifia Harmadi (Medium)
- [Article] How to Handle Missing Data with Python by Jason Brownlee (Machine Learning Mastery)
- [Article] Statistical Imputation for Missing Values in Machine Learning by Jason Brownlee (Machine Learning Mastery)
- [Article] Imputing Missing Data with Simple and Advanced Techniques by Idil Ismiguzel (Towards Data Science)
- [Article] Missing Data Imputation Approaches | How to handle missing values in Python by Selva Prabhakaran (Machine Learning +)
- [Article] Master The Skills Of Missing Data Imputation Techniques In Python(2022) And Be Successful by Mrinal Walia (Analytics Vidhya)
- [Article] How to Preprocess Data in Python by Afroz Chakure (BuiltIn)
- [Article] Easy Guide To Data Preprocessing In Python by Ahmad Anis (KDNuggets)
- [Article] Data Preprocessing in Python by Tarun Gupta (Towards Data Science)
- [Article] Data Preprocessing using Python by Suneet Jain (Medium)
- [Article] Data Preprocessing in Python by Abonia Sojasingarayar (Medium)
- [Article] Data Preprocessing in Python by Afroz Chakure (Medium)
- [Article] Detecting and Treating Outliers | Treating the Odd One Out! by Harika Bonthu (Analytics Vidhya)
- [Article] Outlier Treatment with Python by Sangita Yemulwar (Analytics Vidhya)
- [Article] A Guide to Outlier Detection in Python by Sadrach Pierre (BuiltIn)
- [Article] How To Find Outliers in Data Using Python (and How To Handle Them) by Eric Kleppen (Career Foundry)
- [Article] Statistics in Python — Collinearity and Multicollinearity by Wei-Meng Lee (Towards Data Science)
- [Article] Understanding Multicollinearity and How to Detect it in Python by Terence Shin (Towards Data Science)
- [Article] A Python Library to Remove Collinearity by Gianluca Malato (Your Data Teacher)
- [Article] 8 Best Data Transformation in Pandas by Tirendaz AI (Medium)
- [Article] Data Transformation Techniques with Python: Elevate Your Data Game! by Siddharth Verma (Medium)
- [Article] Data Scaling with Python by Benjamin Obi Tayo (KDNuggets)
- [Article] How to Use StandardScaler and MinMaxScaler Transforms in Python by Jason Brownlee (Machine Learning Mastery)
- [Article] Feature Engineering: Scaling, Normalization, and Standardization by Aniruddha Bhandari (Analytics Vidhya)
- [Article] How to Normalize Data Using scikit-learn in Python by Jayant Verma (Digital Ocean)
- [Article] What are Categorical Data Encoding Methods | Binary Encoding by Shipra Saxena (Analytics Vidhya)
- [Article] Guide to Encoding Categorical Values in Python by Chris Moffitt (Practical Business Python)
- [Article] Categorical Data Encoding Techniques in Python: A Complete Guide by Soumen Atta (Medium)
- [Article] Categorical Feature Encoding Techniques by Tara Boyle (Medium)
- [Article] Ordinal and One-Hot Encodings for Categorical Data by Jason Brownlee (Machine Learning Mastery)
- [Article] Hypothesis Testing with Python: Step by Step Hands-On Tutorial with Practical Examples by Ece Işık Polat (Towards Data Science)
- [Article] 17 Statistical Hypothesis Tests in Python (Cheat Sheet) by Jason Brownlee (Machine Learning Mastery)
- [Article] A Step-by-Step Guide to Hypothesis Testing in Python using Scipy by Gabriel Rennó (Medium)
- [Article] How Does Backpropagation in a Neural Network Work? by Anas Al-Masri (Builtin)
- [Article] A Step by Step Backpropagation Example by Matt Mazur (MattMazur.Com)
- [Article] Understanding Backpropagation by Brent Scarff (Towards Data Science)
- [Article] Understanding Backpropagation Algorithm by Simeon Kostadinov (Towards Data Science)
- [Article] A Comprehensive Guide to the Backpropagation Algorithm in Neural Networks by Ahmed Gad (Neptune.AI)
- [Article] Backpropagation by John McGonagle, George Shaikouski and Christopher Williams (Brilliant)
- [Article] Backpropagation in Neural Networks by Inna Logunova (Serokell.IO)
- [Article] Backpropagation Concept Explained in 5 Levels of Difficulty by Devashish Sood (Medium)
- [Article] BackProp Explainer by Donny Bertucci (GitHub)
- [Article] Backpropagation Algorithm in Neural Network and Machine Learning by Intellipaat Team
- [Article] Understanding Backpropagation in Neural Networks by Tech-AI-Math Team
- [Article] Backpropagation Neural Network using Python by Avinash Navlani (Machine Learning Geek)
- [Article] Back Propagation in Neural Network: Machine Learning Algorithm by Daniel Johnson (Guru99)
- [Article] What is Backpropagation? by Thomas Wood (DeepAI.Org)
- [Article] Activation Functions in Neural Networks [12 Types & Use Cases] by Pragati Baheti (V7.Com)
- [Article] Activation Functions in Neural Networks by Sagar Sharma (Towards Data Science)
- [Article] Comparison of Sigmoid, Tanh and ReLU Activation Functions by Sandeep Kumar (AItude.Com)
- [Article] How to Choose an Activation Function for Deep Learning by Jason Brownlee (Machine Learning Mastery)
- [Article] Choosing the Right Activation Function in Deep Learning: A Practical Overview and Comparison by Okan Yenigün (Medium)
- [Article] Activation Functions in Neural Networks by Geeks For Geeks Team
- [Article] A Practical Comparison of Activation Functions by Danny Denenberg (Medium)
- [Article] Activation Functions in Neural Networks: With 15 examples by Nikolaj Buhl (Encord.Com)
- [Article] Activation functions used in Neural Networks - Which is Better? by Anish Singh Walia (Medium)
- [Article] 6 Types of Activation Function in Neural Networks You Need to Know by Kechit Goyal (UpGrad.Com)
- [Article] Activation Functions in Neural Networks by SuperAnnotate Team
- [Article] Compare Activation Layers by MathWorks Team
- [Article] Activation Functions In Neural Networks by Kurtis Pykes (Comet.Com)
- [Article] ReLU vs. Sigmoid Function in Deep Neural Networks by Ayush Thakur (Wanb.AI)
- [Article] Using Activation Functions in Neural Networks by Jason Bronwlee (Machine Learning Mastery)
- [Article] Activation Function: Top 9 Most Popular Explained & When To Use Them by Neri Van Otten (SpotIntelligence.Com)
- [Article] 5 Deep Learning and Neural Network Activation Functions to Know by Artem Oppermann (BuiltIn.Com)
- [Article] Activation Functions in Deep Learning: Sigmoid, tanh, ReLU by Artem Oppermann
- [Article] 7 Types of Activation Functions in Neural Network by Dinesh Kumawat (AnalyticsSteps.Com)
- [Article] What is an Activation Function? A Complete Guide by Petru Potrimba (RoboFlow.Com)
- [Article] Various Optimization Algorithms For Training Neural Network by Sanket Doshi (Towards Data Science)
- [Article] Optimization Algorithms in Neural Networks by Nagesh Singh Chauhan (KDNuggets)
- [Article] A Comprehensive Guide on Optimizers in Deep Learning by Ayush Gupta (Analytics Vidhya)
- [Article] How to Manually Optimize Neural Network Models by Jason Brownlee (Machine Learning Mastery)
- [Article] How to Choose an Optimization Algorithm by Jason Brownlee (Machine Learning Mastery)
- [Article] Types of Optimization Algorithms used in Neural Networks and Ways to Optimize Gradient Descent by Anish Singh Walia (Medium)
- [Article] Optimizing Neural Networks: Strategies and Techniques by Kajeeth Kumar (AI.PlainEnglish.IO)
- [Article] Neural Network Optimization by Matthew Stewart (Towards Data Science)
- [Article] Optimizers in Deep Learning by Cathrine Jeeva (Scaler)
- [Article] Understanding Deep Learning Optimizers: Momentum, AdaGrad, RMSProp & Adam by Vyacheslav Efimov (Towards Data Science)
- [Article] Types of Optimizers in Deep Learning Every AI Engineer Should Know by Pavan Vadapalli (UpGrad.Com)
- [Article] Optimizers by ML-Cheatsheet Team
- [Article] Optimizers by Edge Info Team
- [Article] Navigating Neural Network Optimization: A Comprehensive Guide to Types of Optimizers by Muhammad Zain Tariq (Medium)
- [Article] An Introduction to Artificial Neural Network Optimizers by Pradeep Natarajan
- [Article] Optimizers Explained for Training Neural Networks by Kartik Chaudhary (DropsOfAI.Com)
- [Article] Optimizers in Neural Networks by AI Ensured Team
- [Article] Parameter Optimization in Neural Networks by DeepLearning.AI Team
- [Article] Overview of Different Optimizers for Neural Networks by Renu Khandelwal (Medium)
- [Article] Which Optimizer should I use for my ML Project? by Lightly.AI Team
- [Article] A Journey into Optimization Algorithms for Deep Neural Networks by Sergios Karagiannakos (AISummer.Com)
- [Article] How to Compare Keras Optimizers in Tensorflow for Deep Learning by Saurav Maheshkar (WAndB.Com)
- [Article] An Empirical Comparison of Optimizers for Machine Learning Models by Rick Wierenga (Medium)
- [Publication] Data Quality for Machine Learning Tasks by Nitin Gupta, Shashank Mujumdar, Hima Patel, Satoshi Masuda, Naveen Panwar, Sambaran Bandyopadhyay, Sameep Mehta, Shanmukha Guttula, Shazia Afzal, Ruhi Sharma Mittal and Vitobha Munigala (KDD ’21: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining)
- [Publication] Overview and Importance of Data Quality for Machine Learning Tasks by Abhinav Jain, Hima Patel, Lokesh Nagalapatti, Nitin Gupta, Sameep Mehta, Shanmukha Guttula, Shashank Mujumdar, Shazia Afzal, Ruhi Sharma Mittal and Vitobha Munigala (KDD ’20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining)
- [Publication] Multiple Imputation of Discrete and Continuous Data by Fully Conditional Specification by Stef van Buuren (Statistical Methods in Medical Research)
- [Publication] Mathematical Contributions to the Theory of Evolution: Regression, Heredity and Panmixia by Karl Pearson (Royal Society)
- [Publication] A New Family of Power Transformations to Improve Normality or Symmetry by In-Kwon Yeo and Richard Johnson (Biometrika)
- [Course] IBM Data Analyst Professional Certificate by IBM Team (Coursera)
- [Course] IBM Data Science Professional Certificate by IBM Team (Coursera)
- [Course] IBM Machine Learning Professional Certificate by IBM Team (Coursera)
- [Course] Machine Learning Specialization Certificate by DeepLearning.AI Team (Coursera)
from IPython.display import display, HTML
display(HTML("<style>.rendered_html { font-size: 15px; font-family: 'Trebuchet MS'; }</style>"))