Table of Contents¶

Imports
EDA
Data Preprocessing
Modeling
Performance
Build Final Model
Pipeline

Context:¶

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

Imports ¶

In [1]:

# Necessary Imports

# Library to suppress warnings or deprecation notes
import warnings

warnings.filterwarnings("ignore")

# Libraries to help with reading and manipulating data

import pandas as pd
import numpy as np
import statistics as stats
import scipy.stats as spstats

# Library to split data
from sklearn.model_selection import train_test_split

# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

sns.set(color_codes=True)  # adds a nice background to the graphs
%matplotlib inline

# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)

# Libraries to build decision tree classifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

# Libraries to build decision tree regressor
from sklearn.ensemble import (
    BaggingClassifier,
    RandomForestClassifier,
    GradientBoostingClassifier,
    AdaBoostClassifier,
    StackingClassifier,
)

# To tune different models
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier

# To build model for prediction
import statsmodels.stats.api as sms
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm
from statsmodels.tools.tools import add_constant
from sklearn.linear_model import LogisticRegression

# To get diferent metric scores
from sklearn.metrics import (
    f1_score,
    accuracy_score,
    recall_score,
    precision_score,
    confusion_matrix,
    roc_auc_score,
    #plot_confusion_matrix,
    make_scorer,
    precision_recall_curve,
    roc_curve,
)

# To impute missing values
from sklearn.impute import SimpleImputer,KNNImputer

# To undersample and oversample the data
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline, make_pipeline

# To be used for data scaling and one hot encoding
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder

from xgboost import XGBClassifier

In [2]:

# import data
data_train = pd.read_csv("train.csv")
df=data_train.copy()

# display data head
df.head()

Out[2]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

In [3]:

# import data
data_test = pd.read_csv("test.csv")

# display data head
data_test.head()

Out[3]:

	PassengerId	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	892	3	Kelly, Mr. James	male	34.5	0	0	330911	7.8292	NaN	Q
1	893	3	Wilkes, Mrs. James (Ellen Needs)	female	47.0	1	0	363272	7.0000	NaN	S
2	894	2	Myles, Mr. Thomas Francis	male	62.0	0	0	240276	9.6875	NaN	Q
3	895	3	Wirz, Mr. Albert	male	27.0	0	0	315154	8.6625	NaN	S
4	896	3	Hirvonen, Mrs. Alexander (Helga E Lindqvist)	female	22.0	1	1	3101298	12.2875	NaN	S

In [4]:

#get the shape of the data
df.shape

Out[4]:

(891, 12)

In [5]:

# Use info() to print a summary of the DataFrame
df.info()


RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

In [6]:

# Check for columns that have missing values
df.isnull().sum().sort_values(ascending=False)

Out[6]:

Cabin          687
Age            177
Embarked         2
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
SibSp            0
Parch            0
Ticket           0
Fare             0
dtype: int64

Insights¶

There are missing values in Age, Cabin, and Embark. We will need to imput these missing values in some way.

In [7]:

# Coverting the following to "category"
cat_columns = df.select_dtypes(include=["object", "category"]).columns.tolist()

for col in cat_columns:
    df[col] = df[col].astype("category")

# Use info() to print a summary of the DataFrame
df.info()


RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   PassengerId  891 non-null    int64   
 1   Survived     891 non-null    int64   
 2   Pclass       891 non-null    int64   
 3   Name         891 non-null    category
 4   Sex          891 non-null    category
 5   Age          714 non-null    float64 
 6   SibSp        891 non-null    int64   
 7   Parch        891 non-null    int64   
 8   Ticket       891 non-null    category
 9   Fare         891 non-null    float64 
 10  Cabin        204 non-null    category
 11  Embarked     889 non-null    category
dtypes: category(5), float64(2), int64(5)
memory usage: 122.0 KB

In [8]:

# Drop the PassengerId column as it is not needed
df.drop("PassengerId", axis=1, inplace=True)

In [9]:

# Statistical summary of the data
df.describe(include="all").T

Out[9]:

	count	unique	top	freq	mean	std	min	25%	50%	75%	max
Survived	891.0	NaN	NaN	NaN	0.383838	0.486592	0.0	0.0	0.0	1.0	1.0
Pclass	891.0	NaN	NaN	NaN	2.308642	0.836071	1.0	2.0	3.0	3.0	3.0
Name	891	891	Abbing, Mr. Anthony	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Sex	891	2	male	577	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Age	714.0	NaN	NaN	NaN	29.699118	14.526497	0.42	20.125	28.0	38.0	80.0
SibSp	891.0	NaN	NaN	NaN	0.523008	1.102743	0.0	0.0	0.0	1.0	8.0
Parch	891.0	NaN	NaN	NaN	0.381594	0.806057	0.0	0.0	0.0	0.0	6.0
Ticket	891	681	1601	7	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Fare	891.0	NaN	NaN	NaN	32.204208	49.693429	0.0	7.9104	14.4542	31.0	512.3292
Cabin	204	147	C23 C25 C27	4	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Embarked	889	3	S	644	NaN	NaN	NaN	NaN	NaN	NaN	NaN

EDA ¶

In [10]:

# declares the DrawHist function
def DrawHist(data, feature, save_image=False):
    """Takes in the dataframe and feature you want to get a histogram of."""
    # creates histplot for the feature
    plt.figure(figsize=(15, 3))
    sns.histplot(data=data, x=feature, kde=True)

    # plots a green line for the mean
    plt.axvline(x=data[feature].mean(), c="green", label="mean")
    # plots a red line for the median
    plt.axvline(x=data[feature].median(), c="red", label="median")
    # Adds a legend
    plt.legend(ncol=1, loc="upper right", frameon=True)
    # Adds a title to the histogram
    plt.title("Histogram: " + feature, fontdict={"fontsize": 20}, pad=10)

    # save to image
    if save_image:
        plt.savefig(
            "Histogram_" + feature + ".jpg", bbox_inches="tight", pad_inches=0.5
        )

In [11]:

# declares the DrawHist function
def DrawBox(data, feature, save_image=False):
    """Takes in the dataframe and feature you want to get a boxplot of."""
    # creates boxplot
    plt.figure(figsize=(15, 1))
    sns.boxplot(data=data, x=feature)

    # adds a title to the plot
    plt.title("Box Plot: " + feature, fontdict={"fontsize": 20}, pad=10)

    # save to image
    if save_image:
        plt.savefig("BoxPlot_" + feature + ".jpg", bbox_inches="tight", pad_inches=0.5)

In [12]:

# declares the DrawCountPlot function
def DrawCountPlot(data, feature, num=10, save_image=False):
    """Takes in a dataframe, feature, and number of items to display on the x axis."""
    # creates a count plot
    plt.figure(figsize=(15, 3))  # To resize the plot
    sns.countplot(
        data=data, x=feature, order=data[feature].value_counts().iloc[:num].index
    )
    # rotates the ticks on the x axis
    plt.xticks(rotation=90)
    # adds a title to the plot
    plt.title("Count Plot: " + feature, fontdict={"fontsize": 20}, pad=10)

    # save to image
    if save_image:
        plt.savefig(
            "CountPlot_" + feature + ".jpg", bbox_inches="tight", pad_inches=0.5
        )
    plt.show()

In [13]:

# declares the DrawCountPlot function
def DrawStackedBarPlot(data, feature_x, feature_y, max_elements=10, save_image=False):
    num_elements = len(data[feature_x].unique())
    if (num_elements <= max_elements) and (feature_x!=feature_y):
        # stacked bar plot
        tab = pd.crosstab(data[feature_x], data[feature_y], normalize="index").sort_values(
            by=data[feature_y].value_counts().index[-1], ascending=True
        )
        tab.plot(kind="barh", stacked=True, figsize=(20, num_elements / 2))

        # adds legend to plot
        plt.legend(loc="upper left", bbox_to_anchor=(1, 1), frameon=False)

        # adds title to plot
        plt.title(feature_x + " vs " + feature_y, fontdict={"fontsize": 25}, pad=10)

        # save to image
        if save_image:
            plt.savefig(
                "Stacked_Bar_Plot" + "_" + feature_y + "_" + feature_x + ".jpg",
                bbox_inches="tight",
                pad_inches=0.5,
            )

Univariate ¶

In [14]:

# defines a function to perform a univariate analysis using histplot,
# and boxplot for numerical feature, and countplot for categorical features
def Univariate_Analysis(
    data, features, num=15, save_image=False, display_categorical=True
):
    # creates a copy of the data
    cData = data.copy()
    for col in cData[features]:
        if cData[col].dtypes == "float64" or cData[col].dtypes == "int64":
            # creates histplot of feature
            DrawHist(data=cData, feature=col, save_image=save_image)
            # creates boxplot of feature
            DrawBox(data=cData, feature=col, save_image=save_image)
        else:
            if display_categorical:
                # creates countplot of feature
                DrawCountPlot(data=cData, feature=col, num=num, save_image=save_image)

In [15]:

# Perform Univariate Analysis
Univariate_Analysis(data=df, features=df.columns, num=25)

No description has been provided for this image

In [16]:

print(round(len(df[df['Survived']==0])/len(df)*100,2),"% of passangers died in the training data.")

61.62 % of passangers died in the training data.

Insights¶

More passangers died than survived.
There were more passangers in 3rd class than both 1st and 2nd combined.
About twice as many passanger were male.
The mean age was 30 year old. Age contains a slight right tail.
SibSp is quite skewed with a long right tail. Some outliers appear to be present.
Parch also has a large right tail.
The mean and median fare is quite low with a few very high outliers.
Vast majority Embarked from S.

Bivariate ¶

In [17]:

# heat map of numerical features.
fig, ax = plt.subplots(figsize=(20, 5))
sns.heatmap(
    df.corr(), ax=ax, annot=True, linewidths=0.05, fmt=".2f", cmap="magma"
)  # the color intensity is based on
# plt.savefig('HeatMap.jpg',bbox_inches ="tight",pad_inches = 0.5)
plt.show()

Insights¶

The strongest corrilation with survived is with PClass -0.34.
A positive corrilation with survived and Fare of 0.26.
The strongest positive corrilation is between Prach and SibSp 0.41.
A strong negative corrilation between Pclass and Fare -0.55.

In [18]:

# Define a function to perform a Bivariate analysis with a chosen feature across all other categorical features.
# Then creates a pairplots with the chosen feature across all numerical features.
def Bivariate_Analysis(data, feature, max_elements=10, save_image=False):
    plt.figure(figsize=(20, 5))

    # Box plot of categorical variables against the chosen feature
    # push all categorical features into a dataframe
    categorical_cols = data.select_dtypes(include=["category", "object"]).columns
    for col in categorical_cols:
        # Draw a stacked bar plot
        DrawStackedBarPlot(
            data=data,
            feature_x=col,
            feature_y=feature,
            max_elements=max_elements,
            save_image=save_image,
        )

    # Pair plot of numarical variables against the chosen feature
    # push all numerical features into a dataframe
    numarical_cols = data.select_dtypes(include=["int", "float"]).columns
    n = 0
    while n <= len(numarical_cols) - 5:
        sns.pairplot(
            data=data, x_vars=numarical_cols[n : n + 5], y_vars=feature, kind="reg"
        )
        # plt.title('Pair Plot: ' + feature + ' vs Numarical Columns',fontdict = {'fontsize' : 20},pad=10)
        n = n + 5

    # Catch if the number of numarical columns is less than 5
    if len(numarical_cols) <= 5:
        sns.pairplot(
            data=data, x_vars=numarical_cols[n : n + 5], y_vars=feature, kind="reg"
        )

    # save to image for the pairplot
    if save_image:
        plt.savefig(
            "PairPlot" + "_" + feature + ".jpg", bbox_inches="tight", pad_inches=0.5
        )

In [19]:

# performs a bivariate analysis
Bivariate_Analysis(data=df, feature="Survived", max_elements=35)

Insights¶

Way more women survived than men.
More survived if embarked from C
More first class passanger survived than 3rd or 3nd class.
As age increased so did the rate of death.
as Parch increases so does the rate of survival.

Key Takeaways¶

The passangers that have the greatest likely hood of survival are 1st class young women.
I feel the most important drivers of survival are pClass, and gender.

Data Preprocessing ¶

remove Name feature
Convert Cabin feature to Deck feature Example Deck A, through Deck G.
impute missing values.
Bin Ticket some how. May just remove for now.
Create dummies for Sex, Embark, and Deck

In [216]:

#make copy of dataframe
df_processed=df.copy()

#Drop features
df_processed.drop(['Name','Ticket'],axis=1,inplace=True)
df_processed.head()

Out[216]:

	Survived	Pclass	Sex	Age	SibSp	Fare	Cabin	Embarked
0	0	3	male	22.0	1	7.2500	NaN	S
1	1	1	female	38.0	1	71.2833	C85	C
2	1	3	female	26.0	0	7.9250	NaN	S
3	1	1	female	35.0	1	53.1000	C123	S
4	0	3	male	35.0	0	8.0500	NaN	S

In [217]:

#Covert Cabin to Deck
df_processed['Deck']=df_processed['Cabin'].apply(lambda x : x[:1])

#Remove Cabin
df_processed.drop(['Cabin'],axis=1,inplace=True)

df_processed.head()

Out[217]:

	Survived	Pclass	Sex	Age	SibSp	Fare	Embarked	Deck
0	0	3	male	22.0	1	7.2500	S	NaN
1	1	1	female	38.0	1	71.2833	C	C
2	1	3	female	26.0	0	7.9250	S	NaN
3	1	1	female	35.0	1	53.1000	S	C
4	0	3	male	35.0	0	8.0500	S	NaN

In [22]:

df_processed['Deck'].value_counts()

Out[22]:

C    59
B    47
D    33
E    32
A    15
F    13
G     4
T     1
Name: Deck, dtype: int64

In [23]:

#define function to replace values in a feature    
def replace_with(data,features=[],replace_struc=[],revert=False):
    # do the replacing
    for feature in features:
        replacement=replace_struc[features.index(feature)]
        if(revert==True): replacement = {v: k for k, v in replacement.items()}
        data[feature].replace(replacement, inplace=True)
    
    return data

# setup a dictionary to do the replacing
Sex = {"male": 0, "female": 1}
Embarked = {"C": 1,"Q": 2,"S": 3}
Deck = {"T": 0,"A": 1,"B": 2,"C": 3,"D": 4,"E": 5,"F": 6,"G": 7}

#use replace_with on the features that are not numerical
df_processed = replace_with(df_processed,['Sex','Embarked','Deck'],[Sex,Embarked,Deck])
    
# display the head
df_processed.head()

Out[23]:

	Survived	Pclass	Sex	Age	SibSp	Fare	Embarked	Deck
0	0	3	0	22.0	1	7.2500	3	NaN
1	1	1	1	38.0	1	71.2833	1	3.0
2	1	3	1	26.0	0	7.9250	3	NaN
3	1	1	1	35.0	1	53.1000	3	3.0
4	0	3	0	35.0	0	8.0500	3	NaN

In [24]:

#imput missing values in train and test data for V1 and V2
imputer = KNNImputer(n_neighbors=5)

col_to_impute=['Embarked','Sex','Age', 'Deck']

df_processed[col_to_impute] = imputer.fit_transform(df_processed[col_to_impute])
df_processed[col_to_impute]=round(df_processed[col_to_impute])

df_processed.head()

Out[24]:

	Survived	Pclass	Sex	Age	SibSp	Fare	Embarked	Deck
0	0	3	0.0	22.0	1	7.2500	3.0	2.0
1	1	1	1.0	38.0	1	71.2833	1.0	3.0
2	1	3	1.0	26.0	0	7.9250	3.0	4.0
3	1	1	1.0	35.0	1	53.1000	3.0	3.0
4	0	3	0.0	35.0	0	8.0500	3.0	3.0

In [25]:

#reverse the replace_with
df_processed = replace_with(df_processed,['Sex','Embarked','Deck'],[Sex,Embarked,Deck],revert=True)

# display the head
df_processed.head()

Out[25]:

	Survived	Pclass	Sex	Age	SibSp	Fare	Embarked	Deck
0	0	3	male	22.0	1	7.2500	S	B
1	1	1	female	38.0	1	71.2833	C	C
2	1	3	female	26.0	0	7.9250	S	D
3	1	1	female	35.0	1	53.1000	S	C
4	0	3	male	35.0	0	8.0500	S	C

In [26]:

# performs a bivariate analysis
Bivariate_Analysis(data=df_processed, feature="Survived", max_elements=35)

Insights:¶

Female passangers are still greatly more likely to survive than male passangers.
Passangers that embarked from C were more likely to survive.
Passangers on deck E where more likely to survive. If we discount deck T (Tank Top), passangers from deck B saw the most deaths.
By looking at decks A,B,C, compared to decks D,E,F one could say a passanger was more likely to survive if they stayed in on the lower desks as apposed to the upper decks.
Pclass has a strong negative relationship with survival.

In [27]:

#Bin decks into upper and lower decks
#Upper decks will be defined as decks A,B,C
#Lower decks will be defined as decks D,E,F,G

# setup a dictionary to do the replacing
DeckLevel = {"T": "upper","A": "upper","B": "upper","C": "upper","D": "lower","E": "lower","F": "lower","G": "lower"}

df_processed['DeckLevel']=df_processed['Deck'];

#use replace_with on the features that are not numerical
df_processed = replace_with(df_processed,['DeckLevel'],[DeckLevel])
    
# display the head
df_processed.head()

Out[27]:

	Survived	Pclass	Sex	Age	SibSp	Fare	Embarked	Deck	DeckLevel
0	0	3	male	22.0	1	7.2500	S	B	upper
1	1	1	female	38.0	1	71.2833	C	C	upper
2	1	3	female	26.0	0	7.9250	S	D	lower
3	1	1	female	35.0	1	53.1000	S	C	upper
4	0	3	male	35.0	0	8.0500	S	C	upper

In [28]:

# performs a bivariate analysis
Bivariate_Analysis(data=df_processed, feature="Survived", max_elements=35)

Insights¶

as expected this shows that the lower decks had a greater likely hood of survival.

In [29]:

# defines a function to get the outlier powers
def Get_Outlier_Powers(data, feature, limits=[0.25, 0.75], print_quartiles=True):
    # gets the quartiles of the feature based on the limits set
    quartiles = np.quantile(data[feature][data[feature].notnull()], limits)
    # calculates the power_4iqr
    power_4iqr = 4 * (quartiles[1] - quartiles[0])
    # print what Q1, Q3, and 4*IQR are if desired
    if print_quartiles:
        print(
            "Q1 =", quartiles[0], "Q3 =", quartiles[1], "4*IQR =", round(power_4iqr, 2)
        )
    # calulates the outlier powers
    outlier_powers = data.loc[
        np.abs(data[feature] - data[feature].median()) > power_4iqr, feature
    ]
    return outlier_powers

In [30]:

# Define a function to detect outliers in numerical columns and display a boxplot
def Detect_Outliers(data, features, limits=[0.25, 0.75], save_image=False):
    cData = data.copy()
    for col in cData[features]:
        if cData[col].dtypes == "float64" or cData[col].dtypes == "int64":
            # Find Quartiles
            quartiles = np.quantile(cData[col][cData[col].notnull()], limits)
            # calulate power_4iqr
            power_4iqr = 4 * (quartiles[1] - quartiles[0])

            # Draw BoxPlot
            plt.figure(figsize=(15, 1))
            sns.boxplot(data=cData, x=col)
            plt.title(
                col
                + " | "
                + " Q1 = "
                + str(quartiles[0])
                + " | Q3 = "
                + str(quartiles[1])
                + " | 4*IQR = "
                + str(round(power_4iqr, 2)),
                fontdict={"fontsize": 20},
                pad=10,
            )

            # save to image
            if save_image:
                plt.savefig(
                    "BoxPlot_" + feature + ".jpg", bbox_inches="tight", pad_inches=0.5
                )

In [31]:

# define a function that will remove the outliers of the chosen features
# based on the set limits and their outlier powers
def Drop_Outliers(data, features, limits=[0.25, 0.75]):
    # make of copy data
    cData = data.copy()
    for col in cData[features]:
        # get outlier_powers
        outlier_powers = Get_Outlier_Powers(
            cData, col, limits=limits, print_quartiles=False
        )
        # drop the outliers from the data
        cData.drop(outlier_powers.index, axis=0, inplace=True)

    return cData

In [32]:

# Run Detect_Outliers
Detect_Outliers(data=df_processed, features=df_processed.columns, limits=[0.25, 0.75])

In [33]:

# # Run Drop_Outliers on df_processed_no_cat and store in df_processed_outliers_dropped
# df_processed_outliers_dropped = Drop_Outliers(
#     data=df_processed,
#     features=["Fare", "Parch", "SibSp"],
#     limits=[0.1, 0.9],
# )

# # run Detect_Outliers again to get a view of the numerical features after dropping outliers
# Detect_Outliers(
#     data=df_processed_outliers_dropped, features=df_processed_outliers_dropped.columns
# )

# # push a copy of df_processed_outliers_dropped in to df_cleaned
# df_cleaned = df_processed_outliers_dropped.copy()

In [34]:

# df_processed=df_cleaned

Split Data ¶

In [35]:

# Drop features
df_processed2=df_processed.copy();
df_processed2.drop(['Deck','DeckLevel'],axis=1,inplace=True)

In [36]:

df_modeling=df_processed2.copy()

In [37]:

# Separating features and the target column
X = df_modeling.drop("Survived", axis=1)
Y = df_modeling["Survived"]

# create dummies
X = pd.get_dummies(
    X,
    columns=X.select_dtypes(include=["object", "category"]).columns.tolist(),
    drop_first=True
)

# display head
X.head()

Out[37]:

	Pclass	Age	SibSp	Fare	Sex_male	Embarked_S
0	3	22.0	1	7.2500	1	1
1	1	38.0	1	71.2833	0	0
2	3	26.0	0	7.9250	0	1
3	1	35.0	1	53.1000	0	1
4	3	35.0	0	8.0500	1	1

In [38]:

# Splitting the data into train and test sets in 75:25 ratio
x_train, x_val, y_train, y_val = train_test_split(
    X, Y, test_size=0.2, random_state=1, shuffle=True, stratify=Y
)

# get shape of training and test set
x_train.shape, x_val.shape

Out[38]:

((712, 8), (179, 8))

In [39]:

# Print percentage of training set
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))

# Print percentage of test set
print("Percentage of classes in test set:")
print(y_val.value_counts(normalize=True))

# setting the class weight
class_weighting = {
    0: round(y_train.value_counts(normalize=True)[1], 2),
    1: round(y_train.value_counts(normalize=True)[0], 2),
}
print("*" * 50)
print("Class Weighting:", class_weighting)

Percentage of classes in training set:
0    0.616573
1    0.383427
Name: Survived, dtype: float64
Percentage of classes in test set:
0    0.614525
1    0.385475
Name: Survived, dtype: float64
**************************************************
Class Weighting: {0: 0.38, 1: 0.62}

Modeling ¶

Original Models¶

In [40]:

# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {
            "Accuracy": acc,
            "Recall": recall,
            "Precision": precision,
            "F1": f1 
        },
        index=[0],
    )

    return df_perf

In [41]:

def confusion_matrix_sklearn(model, predictors, target):
    """
    To plot the confusion_matrix with percentages

    model: classifier
    predictors: independent variables
    target: dependent variable
    """
    y_pred = model.predict(predictors)
    cm = confusion_matrix(target, y_pred)
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    ).reshape(2, 2)

    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")

In [42]:

#define function to build all models and display results of choosen scorer
def BuildModels(models,x_train,y_train,score='recall'):
    '''Builds all models and displays k-fold cross validation and results of validation set of choosen scorer.'''
    
    #catch if no model defined then display message and return
    if len(models)==0: 
        print('please pass in atleast 1 model.')
        return
    
    #nessessary imports if not already done yet
    import sklearn.metrics as metrics
    from sklearn.metrics import (
        recall_score,
        accuracy_score,
        precision_score,
        f1_score,
    )
    from sklearn.model_selection import StratifiedKFold, cross_val_score

    #define scorer
    if(score=='recall'): scorer = metrics.make_scorer(metrics.recall_score)
    if(score=='accuracy'): scorer = metrics.make_scorer(metrics.accuracy_score)
    if(score=='precision'): scorer = metrics.make_scorer(metrics.precision_score)
    if(score=='f1'): scorer = metrics.make_scorer(metrics.f1_score)
        
    results = []  # Empty list to store all model's CV scores
    names = []  # Empty list to store name of the models
    fitted_models=[] # Empty list to store all fitted models

    # loop through all models to get the mean cross validated score
    print("\n" "Cross-Validation Cost:" "\n")

    for name, model in models:
        kfold = StratifiedKFold(
            n_splits=5, shuffle=True, random_state=1
        )  # Setting number of splits equal to 5
        cv_result = cross_val_score(
            estimator=model, X=x_train, y=y_train, scoring=scorer, cv=kfold
        )
        results.append(cv_result)
        names.append(name)
        print("{}: {}".format(name, cv_result.mean()))
        
    # loop through all models to get the Validation Performance
    print("\n" "Validation Performance:" "\n")

    for name, model in models:
        model.fit(x_train, y_train)
        scores = recall_score(y_val, model.predict(x_val))
        print("{}: {}".format(name, scores))
        fitted_models.append(model)
    
    return fitted_models

In [43]:

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.accuracy_score)

In [44]:

#define all model we will be building
models = []  # Empty list to store all the models

# Appending models into the list
models.append(("Logistic Regression", LogisticRegression(random_state=1)))
models.append(("Decision Tree", DecisionTreeClassifier(random_state=1)))
models.append(("Random Forest", RandomForestClassifier(random_state=1)))
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("GBM", GradientBoostingClassifier(random_state=1)))
models.append(("XGB", XGBClassifier(random_state=1)))
models.append(("Adaboost", AdaBoostClassifier(random_state=1)))

In [45]:

#build models using original data
models_original=BuildModels(models,x_train=x_train,y_train=y_train,score='accuracy')

Cross-Validation Cost:

Logistic Regression: 0.8047572146163695
Decision Tree: 0.7612035851472471
Random Forest: 0.796395154141633
Bagging: 0.7935684034275583
GBM: 0.825824879346006
XGB: 0.7892839554811386
Adaboost: 0.8047670639219936

Validation Performance:

Logistic Regression: 0.6956521739130435
Decision Tree: 0.6956521739130435
Random Forest: 0.7536231884057971
Bagging: 0.7101449275362319
GBM: 0.7681159420289855
XGB: 0.7971014492753623
Adaboost: 0.6666666666666666

Insights¶

The best model under Cross-Validation Cost is GBM
The best model under the validation Perfomance is XBG

In [46]:

#Display the validation performance across all models using the original data.
for model in models_original:
    name=type(model).__name__
    mod_pref=model_performance_classification_sklearn(model=model,predictors=x_val,target=y_val)
    print(name,"\n",mod_pref,"\n")

LogisticRegression 
    Accuracy    Recall  Precision        F1
0  0.782123  0.695652   0.727273  0.711111 

DecisionTreeClassifier 
    Accuracy    Recall  Precision        F1
0  0.793296  0.695652       0.75  0.721805 

RandomForestClassifier 
    Accuracy    Recall  Precision        F1
0  0.832402  0.753623        0.8  0.776119 

BaggingClassifier 
    Accuracy    Recall  Precision        F1
0  0.798883  0.710145   0.753846  0.731343 

GradientBoostingClassifier 
    Accuracy    Recall  Precision        F1
0  0.849162  0.768116   0.828125  0.796992 

XGBClassifier 
    Accuracy    Recall  Precision        F1
0  0.843575  0.797101   0.797101  0.797101 

AdaBoostClassifier 
    Accuracy    Recall  Precision       F1
0  0.776536  0.666667   0.730159  0.69697

Insights¶

Here the GBM is better than the XGB but only very slightly.

Over Sampling ¶

In [47]:

# Synthetic Minority Over Sampling Technique
sm = SMOTE(sampling_strategy=1, k_neighbors=5, random_state=1)
x_train_over, y_train_over = sm.fit_resample(x_train, y_train)

In [48]:

print("Before OverSampling, count of label '1': {}".format(sum(y_train == 1)))
print("Before OverSampling, count of label '0': {} \n".format(sum(y_train == 0)))

print("After OverSampling, count of label '1': {}".format(sum(y_train_over == 1)))
print("After OverSampling, count of label '0': {} \n".format(sum(y_train_over == 0)))

print("After OverSampling, the shape of train_X: {}".format(x_train_over.shape))
print("After OverSampling, the shape of train_y: {} \n".format(y_train_over.shape))

Before OverSampling, count of label '1': 273
Before OverSampling, count of label '0': 439 

After OverSampling, count of label '1': 439
After OverSampling, count of label '0': 439 

After OverSampling, the shape of train_X: (878, 8)
After OverSampling, the shape of train_y: (878,)

In [49]:

#build models using oversample data
models_over=BuildModels(models,x_train=x_train_over,y_train=y_train_over,score='accuracy')

Cross-Validation Cost:

Logistic Regression: 0.8177857142857142
Decision Tree: 0.8017922077922076
Random Forest: 0.8303246753246754
Bagging: 0.8246233766233766
GBM: 0.8314480519480518
XGB: 0.8360194805194805
Adaboost: 0.8257922077922079

Validation Performance:

Logistic Regression: 0.7101449275362319
Decision Tree: 0.7681159420289855
Random Forest: 0.7391304347826086
Bagging: 0.7681159420289855
GBM: 0.7681159420289855
XGB: 0.7971014492753623
Adaboost: 0.7246376811594203

Insights¶

When over sampled data is used XGB model is the best.

In [50]:

#Display the validation performance across all models using the oversample data.
for model in models_over:
    name=type(model).__name__
    mod_pref=model_performance_classification_sklearn(model=model,predictors=x_val,target=y_val)
    print(name,"\n",mod_pref,"\n")

LogisticRegression 
    Accuracy    Recall  Precision        F1
0  0.776536  0.710145   0.710145  0.710145 

DecisionTreeClassifier 
    Accuracy    Recall  Precision        F1
0  0.804469  0.768116   0.736111  0.751773 

RandomForestClassifier 
    Accuracy   Recall  Precision        F1
0  0.804469  0.73913       0.75  0.744526 

BaggingClassifier 
    Accuracy    Recall  Precision        F1
0  0.826816  0.768116   0.779412  0.773723 

GradientBoostingClassifier 
    Accuracy    Recall  Precision        F1
0  0.826816  0.768116   0.779412  0.773723 

XGBClassifier 
    Accuracy    Recall  Precision        F1
0  0.843575  0.797101   0.797101  0.797101 

AdaBoostClassifier 
    Accuracy    Recall  Precision        F1
0  0.776536  0.724638   0.704225  0.714286

In [51]:

#Display the confusion matrix for the random forest model
rForest_over=RandomForestClassifier(random_state=1)
rForest_over.fit(x_train_over,y_train_over)
confusion_matrix_sklearn(model=rForest_over,predictors=x_val,target=y_val)

Under Sampling ¶

In [52]:

# Random undersampler for under sampling the data
rus = RandomUnderSampler(random_state=1, sampling_strategy=1)
x_train_under, y_train_under = rus.fit_resample(x_train, y_train)

In [53]:

print("Before Under Sampling, count of label '1': {}".format(sum(y_train == 1)))
print("Before Under Sampling, count of label '0': {} \n".format(sum(y_train == 0)))

print("After Under Sampling, count of label '1': {}".format(sum(y_train_under == 1)))
print("After Under Sampling, count of label '0': {} \n".format(sum(y_train_under == 0)))

print("After Under Sampling, the shape of train_X: {}".format(x_train_under.shape))
print("After Under Sampling, the shape of train_y: {} \n".format(y_train_under.shape))

Before Under Sampling, count of label '1': 273
Before Under Sampling, count of label '0': 439 

After Under Sampling, count of label '1': 273
After Under Sampling, count of label '0': 273 

After Under Sampling, the shape of train_X: (546, 8)
After Under Sampling, the shape of train_y: (546,)

In [54]:

#build models using undersample data
models_under=BuildModels(models,x_train=x_train_under,y_train=y_train_under,score='accuracy')

Cross-Validation Cost:

Logistic Regression: 0.8021351125938281
Decision Tree: 0.7452376980817348
Random Forest: 0.7801501251042535
Bagging: 0.7709924937447873
GBM: 0.7783319432860718
XGB: 0.7618181818181817
Adaboost: 0.7746622185154296

Validation Performance:

Logistic Regression: 0.7391304347826086
Decision Tree: 0.7536231884057971
Random Forest: 0.782608695652174
Bagging: 0.782608695652174
GBM: 0.8115942028985508
XGB: 0.8260869565217391
Adaboost: 0.7391304347826086

Insights¶

When under sampled data is used XGB and GBM are about equal.

In [55]:

#Display the validation performance across all models using the undersample data.
for model in models_under:
    name=type(model).__name__
    mod_pref=model_performance_classification_sklearn(model=model,predictors=x_val,target=y_val)
    print(name,"\n",mod_pref,"\n")

LogisticRegression 
    Accuracy   Recall  Precision       F1
0  0.776536  0.73913    0.69863  0.71831 

DecisionTreeClassifier 
    Accuracy    Recall  Precision        F1
0   0.77095  0.753623   0.684211  0.717241 

RandomForestClassifier 
    Accuracy    Recall  Precision    F1
0  0.798883  0.782609       0.72  0.75 

BaggingClassifier 
    Accuracy    Recall  Precision        F1
0  0.826816  0.782609   0.771429  0.776978 

GradientBoostingClassifier 
    Accuracy    Recall  Precision        F1
0  0.826816  0.811594   0.756757  0.783217 

XGBClassifier 
    Accuracy    Recall  Precision        F1
0  0.765363  0.826087   0.655172  0.730769 

AdaBoostClassifier 
    Accuracy   Recall  Precision        F1
0   0.77095  0.73913   0.689189  0.713287

In [56]:

#Display the confusion matrix for the random forest model
rForest_under=RandomForestClassifier(random_state=1)
rForest_under.fit(x_train_under,y_train_under)
confusion_matrix_sklearn(model=rForest_under,predictors=x_val,target=y_val)

Hypertuning ¶

RandomForestClassifier¶

In [57]:

%%time
# defining model
Model = RandomForestClassifier(random_state=1,class_weight=class_weighting)

# Parameter grid to pass in RandomSearchCV
param_grid = { 
    "n_estimators": np.arange(10, 100,10), 
    "min_samples_leaf": np.arange(1, 8,1),
    "max_samples": np.arange(0.2, 1, 0.1),
    "max_features": np.arange(0.1, 1, 0.1),
    "max_depth": np.arange(10, 30, 1) 
}

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(
    estimator=Model, param_distributions=param_grid, n_iter=5, n_jobs = -1, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(x_train,y_train)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))

Best parameters are {'n_estimators': 90, 'min_samples_leaf': 2, 'max_samples': 0.6000000000000001, 'max_features': 0.2, 'max_depth': 26} with CV score=0.8020191076529104:
CPU times: user 159 ms, sys: 169 ms, total: 327 ms
Wall time: 2.59 s

In [58]:

#Build Base Model
rForest_original= RandomForestClassifier(random_state=1,class_weight=class_weighting)

# Fit the model on training data
rForest_original.fit(x_train, y_train)

# Creating new model with best parameters
rForest_tuned_original = RandomForestClassifier(
    n_estimators=90,
    min_samples_leaf= 2,
    max_samples= 0.6,
    max_features= 0.2,
    max_depth= 26
)

# Fit the model on training data
rForest_tuned_original.fit(x_train, y_train)

Out[58]:

RandomForestClassifier(max_depth=26, max_features=0.2, max_samples=0.6,
                       min_samples_leaf=2, n_estimators=90)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [59]:

#display performance of base model on validation set
print('Original')
pref_rForest_original = model_performance_classification_sklearn(
    model=rForest_original,predictors=x_val,target=y_val)
print(pref_rForest_original)

#display performance of tuned model on validation set
print('Tuned')
pref_rForest_original_tuned=model_performance_classification_sklearn(
    model=rForest_tuned_original,predictors=x_val,target=y_val)
print(pref_rForest_original_tuned)

#display performance of tuned model on training set
print('Tuned Performance Training')
pref_rForest_original_tuned_train=model_performance_classification_sklearn(
    model=rForest_tuned_original,predictors=x_train,target=y_train)
print(pref_rForest_original_tuned_train)

Original
   Accuracy   Recall  Precision        F1
0  0.821229  0.73913   0.784615  0.761194
Tuned
   Accuracy    Recall  Precision        F1
0  0.843575  0.768116   0.815385  0.791045
Tuned Performance Training
   Accuracy    Recall  Precision        F1
0  0.907303  0.827839   0.922449  0.872587

GBM¶

In [178]:

%%time
# defining model
Model = GradientBoostingClassifier(random_state=1)

param_grid = { 
                "n_estimators": np.arange(50,150,10), #100
                "max_features":np.arange(0.5,1.5,0.5), 
                "max_depth":np.arange(1,10,1), #3
                "max_leaf_nodes":np.arange(1,10,1),
                "learning_rate": np.arange(0.05,0.5,0.05), #0.1
                "subsample":np.arange(0.1,1.5,0.1),
             }

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(
    estimator=Model, param_distributions=param_grid, n_iter=5, n_jobs = -1, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(x_train,y_train)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))

Best parameters are {'subsample': 0.8, 'n_estimators': 100, 'max_leaf_nodes': 7, 'max_features': 1.0, 'max_depth': 2, 'learning_rate': 0.15000000000000002} with CV score=0.8174825174825175:
CPU times: user 113 ms, sys: 10.4 ms, total: 124 ms
Wall time: 402 ms

In [60]:

#Build Base Model
GBM_original= GradientBoostingClassifier(random_state=1)

# Fit the model on training data
GBM_original.fit(x_train, y_train)

# Creating new model with best parameters
GBM_tuned_original = GradientBoostingClassifier(
    subsample= 0.7, 
    n_estimators= 75, 
    max_leaf_nodes= 5, 
    max_features= 0.5, 
    max_depth= 3,
    learning_rate= 0.1
)

# Fit the model on training data
GBM_tuned_original.fit(x_train, y_train)

Out[60]:

GradientBoostingClassifier(max_features=0.5, max_leaf_nodes=5, n_estimators=75,
                           subsample=0.7)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [61]:

#display performance of base model on validation set
print('Original')
pref_GBM_original = model_performance_classification_sklearn(
    model=GBM_original,predictors=x_val,target=y_val)
print(pref_GBM_original)

#display performance of tuned model on validation set
print('Tuned')
pref_GBM_original_tuned=model_performance_classification_sklearn(
    model=GBM_tuned_original,predictors=x_val,target=y_val)
print(pref_GBM_original_tuned)

#display performance of tuned model on training set
print('Tuned Performance Training')
pref_GBM_original_tuned_train=model_performance_classification_sklearn(
    model=GBM_tuned_original,predictors=x_train,target=y_train)
print(pref_GBM_original_tuned_train)

Original
   Accuracy    Recall  Precision        F1
0  0.849162  0.768116   0.828125  0.796992
Tuned
   Accuracy    Recall  Precision        F1
0  0.821229  0.710145   0.803279  0.753846
Tuned Performance Training
   Accuracy   Recall  Precision        F1
0  0.869382  0.78022   0.865854  0.820809

XGB¶

In [185]:

%%time
# defining model
Model = XGBClassifier(random_state=1)

param_grid = { 
                "n_estimators": np.arange(10,150,10), #100
                "max_depth":np.arange(1,10,1), #3
                "learning_rate": np.arange(0.05,0.5,0.05), #0.1
                "subsample":np.arange(0.1,1.5,0.1),
             }

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(
    estimator=Model, param_distributions=param_grid, n_iter=5, n_jobs = -1, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(x_train,y_train)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))

Best parameters are {'subsample': 0.7000000000000001, 'n_estimators': 20, 'max_depth': 9, 'learning_rate': 0.35000000000000003} with CV score=0.7950261006599034:
CPU times: user 159 ms, sys: 31.7 ms, total: 191 ms
Wall time: 204 ms

In [62]:

#Build Base Model
XGB_original= XGBClassifier(random_state=1)

# Fit the model on training data
XGB_original.fit(x_train, y_train)

# Creating new model with best parameters
XGB_tuned_original = XGBClassifier(
    subsample= 0.7, 
    n_estimators= 20, 
    max_depth= 9,
    learning_rate= 0.35
)

# Fit the model on training data
XGB_tuned_original.fit(x_train, y_train)

Out[62]:

XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=0.35, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=9, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              n_estimators=20, n_jobs=None, num_parallel_tree=None,
              predictor=None, random_state=None, ...)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [63]:

#display performance of base model on validation set
print('Original')
pref_XGB_original = model_performance_classification_sklearn(
    model=XGB_original,predictors=x_val,target=y_val)
print(pref_XGB_original)

#display performance of tuned model on validation set
print('Tuned')
pref_XGB_original_tuned=model_performance_classification_sklearn(
    model=XGB_tuned_original,predictors=x_val,target=y_val)
print(pref_XGB_original_tuned)

#display performance of tuned model on training set
print('Tuned Performance Training')
pref_XGB_original_tuned_train=model_performance_classification_sklearn(
    model=XGB_tuned_original,predictors=x_train,target=y_train)
print(pref_XGB_original_tuned_train)

Original
   Accuracy    Recall  Precision        F1
0  0.843575  0.797101   0.797101  0.797101
Tuned
   Accuracy    Recall  Precision       F1
0  0.854749  0.782609   0.830769  0.80597
Tuned Performance Training
   Accuracy    Recall  Precision        F1
0  0.928371  0.868132   0.940476  0.902857

Over Sampling¶

XGB¶

In [221]:

%%time
# defining model
Model = XGBClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid = { 
                "n_estimators": np.arange(10,150,10), #100
                "max_depth":np.arange(1,10,1), #3
                "learning_rate": np.arange(0.05,0.5,0.01), #0.1
                "subsample":np.arange(0.1,1.5,0.1),
             }

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=5, n_jobs = -1,
                                   scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(x_train_over,y_train_over)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))

Best parameters are {'subsample': 0.7000000000000001, 'n_estimators': 60, 'max_depth': 2, 'learning_rate': 0.17000000000000004} with CV score=0.8383311688311688:
CPU times: user 186 ms, sys: 45.3 ms, total: 232 ms
Wall time: 214 ms

In [64]:

#Build base model
XGB_over= XGBClassifier(random_state=1)

# Fit the model on training data
XGB_over.fit(x_train_over, y_train_over)

# Creating new model with best parameters
XGB_tuned_over = XGBClassifier(
    subsample= 0.7, 
    n_estimators= 20, 
    max_depth= 9,
    learning_rate= 0.35
)

# Fit the model on training data
XGB_tuned_over.fit(x_train_over, y_train_over)

Out[64]:

XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=0.35, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=9, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              n_estimators=20, n_jobs=None, num_parallel_tree=None,
              predictor=None, random_state=None, ...)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [65]:

#display performance of base model on validation set
print('Original')
pref_XGB_over = model_performance_classification_sklearn(
    model=XGB_over,predictors=x_val,target=y_val)
print(pref_XGB_over)

#display performance of tuned model on validation set
print('Tuned')
pref_XGB_over_tuned=model_performance_classification_sklearn(
    model=XGB_tuned_over,predictors=x_val,target=y_val)
print(pref_XGB_over_tuned)

#display performance of tuned model on training set
print('\nTuned Training Performance')
pref_XGB_over_tuned_train=model_performance_classification_sklearn(
    model=XGB_tuned_over,predictors=x_train_over,target=y_train_over)
print(pref_XGB_over_tuned_train)

#Display the confusion matrix for the random forest model
confusion_matrix_sklearn(model=XGB_tuned_over,predictors=x_val,target=y_val)

Original
   Accuracy    Recall  Precision        F1
0  0.843575  0.797101   0.797101  0.797101
Tuned
   Accuracy    Recall  Precision        F1
0  0.860335  0.826087   0.814286  0.820144

Tuned Training Performance
   Accuracy    Recall  Precision        F1
0  0.952164  0.943052   0.960557  0.951724

Under Sampling¶

XGB¶

In [225]:

%%time
# defining model
Model = XGBClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid = { 
                "subsample":np.arange(0.1,1,0.1), 
                "n_estimators": np.arange(10,150,10), 
                "max_depth":np.arange(1,10,1),
                "learning_rate": np.arange(0.05,0.5,0.05), 
             }

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=5, n_jobs = -1, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(x_train_under,y_train_under)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))

Best parameters are {'subsample': 0.2, 'n_estimators': 130, 'max_depth': 2, 'learning_rate': 0.05} with CV score=0.792977481234362:
CPU times: user 282 ms, sys: 52.2 ms, total: 334 ms
Wall time: 198 ms

In [66]:

#Build base model
XGB_under= XGBClassifier(random_state=1)

# Fit the model on training data
XGB_under.fit(x_train_under, y_train_under)

# Creating new model with best parameters
XGB_tuned_under = XGBClassifier(
    subsample= 0.7, 
    n_estimators= 20, 
    max_depth= 9,
    learning_rate= 0.35
)

# Fit the model on training data
XGB_tuned_under.fit(x_train_under, y_train_under)

Out[66]:

XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=0.35, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=9, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              n_estimators=20, n_jobs=None, num_parallel_tree=None,
              predictor=None, random_state=None, ...)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [67]:

#display performance of base model on validation set
print('Original')
pref_XGB_under = model_performance_classification_sklearn(
    model=XGB_under,predictors=x_val,target=y_val)
print(pref_XGB_under)

#display performance of tuned model on validation set
print('Tuned')
pref_XGB_under_tuned=model_performance_classification_sklearn(
    model=XGB_tuned_under,predictors=x_val,target=y_val)
print(pref_XGB_under_tuned)

#display performance of tuned model on training set
print('\nTuned Training Performance')
pref_XGB_under_tuned_train=model_performance_classification_sklearn(
    model=XGB_tuned_under,predictors=x_train_under,target=y_train_under)
print(pref_XGB_under_tuned_train)

#Display the confusion matrix for the random forest model
confusion_matrix_sklearn(model=XGB_tuned_under,predictors=x_val,target=y_val)

Original
   Accuracy    Recall  Precision        F1
0  0.765363  0.826087   0.655172  0.730769
Tuned
   Accuracy    Recall  Precision       F1
0  0.815642  0.826087   0.730769  0.77551

Tuned Training Performance
   Accuracy    Recall  Precision        F1
0  0.934066  0.923077    0.94382  0.933333

Performance ¶

In [69]:

# test performance comparison

models_comparison = pd.concat(
    [
        pref_rForest_original.T,
        pref_rForest_original_tuned.T,
        pref_GBM_original.T,
        pref_GBM_original_tuned.T,
        pref_XGB_original.T,
        pref_XGB_original_tuned.T,

        pref_XGB_over.T,
        pref_XGB_over_tuned.T,

        pref_XGB_under.T,
        pref_XGB_under_tuned.T,
    ],
    axis=1,
)

models_comparison.columns = [
    "rForest",
    "rForest tuned",
    "GBM",
    "GBM tuned",
    "XGB",
    "XGB tuned",
    "XGB OV",
    "XGB OV tuned",
    "XGB UN",
    "XGB UN tuned",
]

print("Validation Performance Comparison across the best tuned and untuned models:")
models_comparison

Validation Performance Comparison across the best tuned and untuned models:

Out[69]:

	rForest	rForest tuned	GBM	GBM tuned	XGB	XGB tuned	XGB OV	XGB OV tuned	XGB UN	XGB UN tuned
Accuracy	0.821229	0.843575	0.849162	0.821229	0.843575	0.854749	0.843575	0.860335	0.765363	0.815642
Recall	0.739130	0.768116	0.768116	0.710145	0.797101	0.782609	0.797101	0.826087	0.826087	0.826087
Precision	0.784615	0.815385	0.828125	0.803279	0.797101	0.830769	0.797101	0.814286	0.655172	0.730769
F1	0.761194	0.791045	0.796992	0.753846	0.797101	0.805970	0.797101	0.820144	0.730769	0.775510

Insights¶

-The XGB Over sampled model has the best accuracy of 0.86

Build Final Model ¶

In [284]:

# Separating features and the target column
X = df_modeling.drop("Survived", axis=1)
Y = df_modeling["Survived"]

# create dummies
X = pd.get_dummies(
    X,
    columns=X.select_dtypes(include=["object", "category"]).columns.tolist(),
    drop_first=True
)

# display head
X.head()

Out[284]:

	Pclass	Age	SibSp	Fare	Sex_male	Embarked_S
0	3	22.0	1	7.2500	1	1
1	1	38.0	1	71.2833	0	0
2	3	26.0	0	7.9250	0	1
3	1	35.0	1	53.1000	0	1
4	3	35.0	0	8.0500	1	1

In [287]:

x_train2=X.copy()
y_train2=Y.copy()

In [288]:

# Synthetic Minority Over Sampling Technique
sm = SMOTE(sampling_strategy=1, k_neighbors=5, random_state=1)
x_train_over2, y_train_over2 = sm.fit_resample(x_train2, y_train2)

In [289]:

model_final=XGBClassifier(
    subsample= 0.7, 
    n_estimators= 20, 
    max_depth= 9,
    learning_rate= 0.35,random_state=1
)
model_final.fit(x_train_over2,y_train_over2)

Out[289]:

XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=0.35, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=9, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              n_estimators=20, n_jobs=None, num_parallel_tree=None,
              predictor=None, random_state=1, ...)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [290]:

# #Get the proability of failure or no failure from the model using predict_proba

# #set threshold
# threshold=0.4

# #get probablity of failure
# prob_of_survival=pd.DataFrame(model_final.predict_proba(x_val))[1]

# #Get data frame of at risk generators
# likely_to_survive=pd.DataFrame(prob_of_survival[prob_of_survival>=threshold])

# #Display a message
# print("If We use a threshold of",threshold,"We can say",len(likely_to_survive),"are likely to survive.")

Using the final model to make predications¶

In [291]:

df_test=data_test.copy()
df_test.drop(['PassengerId','Name','Ticket','Cabin'],axis=1,inplace=True)

# create dummies
df_test = pd.get_dummies(
    df_test,
    columns=df_test.select_dtypes(include=["object", "category"]).columns.tolist(),
    drop_first=True
)

df_test.head()

Out[291]:

	Pclass	Age	SibSp	Parch	Fare	Sex_male	Embarked_Q	Embarked_S
0	3	34.5	0	0	7.8292	1	1	0
1	3	47.0	1	0	7.0000	0	0	1
2	2	62.0	0	0	9.6875	1	1	0
3	3	27.0	0	0	8.6625	1	0	1
4	3	22.0	1	1	12.2875	0	0	1

In [292]:

#Using Final Model
#get predications
pred=model_final.predict(df_test)

#get predicated probability of survival
prob_of_survival=pd.DataFrame(model_final.predict_proba(df_test))[1]

#Make a copy of the test data
df_test2=df_test.copy()

#add a Survived column
df_test2['Survived']=pred

#Add a Probability of survival column
df_test2['prob_of_survival']=prob_of_survival*100

#Display the data head with the new columns
df_test2.head()

Out[292]:

	Pclass	Age	SibSp	Parch	Fare	Sex_male	Embarked_Q	Embarked_S	Survived	prob_of_survival
0	3	34.5	0	0	7.8292	1	1	0	0	3.857241
1	3	47.0	1	0	7.0000	0	0	1	0	10.933467
2	2	62.0	0	0	9.6875	1	1	0	0	8.820826
3	3	27.0	0	0	8.6625	1	0	1	0	37.190594
4	3	22.0	1	1	12.2875	0	0	1	1	51.639305

In [293]:

#get total number of 1's
print("Total number of surviving passangers in the test data:",len(df_test2[df_test2['Survived']==1]))

Total number of surviving passangers in the test data: 213

Insights¶

Any passanagers with a probability of survival above 50% recieved a 1
From the test data set of 418 passengaers 213 were expected to survive.
Since we know from the original training data only about 40% of the passengers survived, we can adjusts the models threshold to get results coloser to what we might expect.

In [294]:

#Adjusting the model threshold

#Setting the threshold
threshold=0.74

#Get data frame of passengers likely to survive.
likely_to_survive=pd.DataFrame(df_test2[df_test2['prob_of_survival']>=threshold*100])

#Calculate and display the survivors and the percent of the total.
print("Out of", len(prob_of_survival), "passangers.", len(likely_to_survive), "are likely to survive.")
print(round(len(likely_to_survive)/len(prob_of_survival)*100,2),"% likely to survive.")

Out of 418 passangers. 166 are likely to survive.
39.71 % likely to survive.

In [295]:

#Display list of passengers likely to survive from the test data set.
likely_to_survive.head(10)

Out[295]:

	Pclass	Age	SibSp	Parch	Fare	Sex_male	Embarked_S	Survived	prob_of_survival
8	3	18.0	0	0	7.2292	0	0	1	88.224327
10	3	NaN	0	0	7.8958	1	1	1	74.705513
12	1	23.0	1	0	82.2667	0	1	1	98.513237
14	1	47.0	1	0	61.1750	0	1	1	97.130569
15	2	24.0	1	0	27.7208	0	0	1	95.927986
22	1	NaN	0	0	31.6833	0	1	1	96.884499
23	1	21.0	0	1	61.3792	1	0	1	83.880630
24	1	48.0	1	3	262.3750	0	0	1	89.966728
26	1	22.0	0	1	61.9792	0	0	1	99.371407
29	3	NaN	2	0	21.6792	1	0	1	94.423431

Insights¶

If we use a threshold of 0.74 or 74% 166 passenger would be likely to survive from the total 418 passengers.
That is a 39.71% survival rate much closer to the training sets survival rate of 39%.

Pipeline ¶

In [296]:

#Splitting original data to get a fresh x_train, y_train, x_test, and y_test.
#To be able to pass into the pipeline for training.

#copy data
df_train2=df_modeling.copy()

# Separating features and the target column
X = df_train2.drop("Survived", axis=1)
Y = df_train2["Survived"]

# Splitting the data into train and test sets in 75:25 ratio
x_train2, x_val2, y_train2, y_val2 = train_test_split(
    X, Y, test_size=0.25, random_state=1, shuffle=True, stratify=Y
)

In [297]:

class columnDropperTransformer():
    def __init__(self,columns):
        self.columns=columns

    def transform(self,X,y=None):
        return X.drop(self.columns,axis=1)

    def fit(self, X, y=None):
        return self 
    
class ReplaceWithTransformer():
    def __init__(self,features=[],replace_struc=[],revert=False):
        self.features=features
        self.replace_struc=replace_struc
        self.revert=revert
        
    def replace_with(self,features,replace_struc,revert):
        # do the replacing
        for feature in features:
            replacement=replace_struc[features.index(feature)]
            if(revert==True): replacement = {v: k for k, v in replacement.items()}
            data[feature].replace(replacement, inplace=True)
        return self    

    def transform(self,X,y=None):
        return X.drop(self.columns,axis=1)
        return replace_with(self,self.features,self.replace_struc,self.revert)

    def fit(self, X, y=None):
        return replace_with(self,self.features,self.replace_struc,self.revert)

In [298]:

from sklearn.compose import make_column_selector, make_column_transformer

# defining pipe using make_pipeline
pipe_XGB = make_pipeline(
    SimpleImputer(strategy="most_frequent"),
    OneHotEncoder(handle_unknown="ignore"),
    #SMOTE(sampling_strategy=1, k_neighbors=5, random_state=1),
    (XGBClassifier(
        subsample= 0.7, 
        n_estimators= 20, 
        max_depth= 9, 
        learning_rate= 0.35)
    )
) 

# fit pipe object to data
pipe_XGB.fit(x_train2,y_train2)

Out[298]:

Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='most_frequent')),
                ('onehotencoder', OneHotEncoder(handle_unknown='ignore')),
                ('xgbclassifier',
                 XGBClassifier(base_score=None, booster=None, callbacks=None,
                               colsample_bylevel=None, colsample_bynode=None,
                               colsample_bytree=None,
                               early_stopping_rounds=None,
                               enable_categorical=False, eval_metric=None,
                               feature_...=None,
                               grow_policy=None, importance_type=None,
                               interaction_constraints=None, learning_rate=0.35,
                               max_bin=None, max_cat_threshold=None,
                               max_cat_to_onehot=None, max_delta_step=None,
                               max_depth=9, max_leaves=None,
                               min_child_weight=None, missing=nan,
                               monotone_constraints=None, n_estimators=20,
                               n_jobs=None, num_parallel_tree=None,
                               predictor=None, random_state=None, ...))])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Pipeline

Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='most_frequent')),
                ('onehotencoder', OneHotEncoder(handle_unknown='ignore')),
                ('xgbclassifier',
                 XGBClassifier(base_score=None, booster=None, callbacks=None,
                               colsample_bylevel=None, colsample_bynode=None,
                               colsample_bytree=None,
                               early_stopping_rounds=None,
                               enable_categorical=False, eval_metric=None,
                               feature_...=None,
                               grow_policy=None, importance_type=None,
                               interaction_constraints=None, learning_rate=0.35,
                               max_bin=None, max_cat_threshold=None,
                               max_cat_to_onehot=None, max_delta_step=None,
                               max_depth=9, max_leaves=None,
                               min_child_weight=None, missing=nan,
                               monotone_constraints=None, n_estimators=20,
                               n_jobs=None, num_parallel_tree=None,
                               predictor=None, random_state=None, ...))])

SimpleImputer

SimpleImputer(strategy='most_frequent')

OneHotEncoder

OneHotEncoder(handle_unknown='ignore')

XGBClassifier

XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=0.35, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=9, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              n_estimators=20, n_jobs=None, num_parallel_tree=None,
              predictor=None, random_state=None, ...)

Using the pipline to make predications¶

In [301]:

df_test=data_test.copy()
df_test.drop(['PassengerId','Name','Ticket','Cabin'],axis=1,inplace=True)

df_test.head()

Out[301]:

	Pclass	Sex	Age	SibSp	Parch	Fare	Embarked
0	3	male	34.5	0	0	7.8292	Q
1	3	female	47.0	1	0	7.0000	S
2	2	male	62.0	0	0	9.6875	Q
3	3	male	27.0	0	0	8.6625	S
4	3	female	22.0	1	1	12.2875	S

In [302]:

#Using Pipeline
#get predications
pred=pipe_XGB.predict(df_test)

#get predicated probability of survival
prob_of_survival=pd.DataFrame(pipe_XGB.predict_proba(df_test))[1]

#Make a copy of the test data
df_test2=df_test.copy()

#add a Survived column
df_test2['Survived']=pred

#Add a Probability of survival column
df_test2['prob_of_survival']=prob_of_survival*100

#Display the data head with the new columns
df_test2.head(20)

Out[302]:

	Pclass	Sex	Age	SibSp	Parch	Fare	Embarked	Survived	prob_of_survival
0	3	male	34.5	0	0	7.8292	Q	0	3.603213
1	3	female	47.0	1	0	7.0000	S	0	33.709225
2	2	male	62.0	0	0	9.6875	Q	0	2.217934
3	3	male	27.0	0	0	8.6625	S	0	29.124826
4	3	female	22.0	1	1	12.2875	S	0	41.915558
5	3	male	14.0	0	0	9.2250	S	0	4.398355
6	3	female	30.0	0	0	7.6292	Q	1	57.990604
7	2	male	26.0	1	1	29.0000	S	0	46.846775
8	3	female	18.0	0	0	7.2292	C	1	68.863167
9	3	male	21.0	2	0	24.1500	S	0	2.060762
10	3	male	NaN	0	0	7.8958	S	0	5.000453
11	1	male	46.0	0	0	26.0000	S	0	19.951044
12	1	female	23.0	1	0	82.2667	S	1	97.533676
13	2	male	63.0	1	0	26.0000	S	0	2.513480
14	1	female	47.0	1	0	61.1750	S	1	96.439850
15	2	female	24.0	1	0	27.7208	C	1	91.883820
16	2	male	35.0	0	0	12.3500	Q	0	3.171004
17	3	male	21.0	0	0	7.2250	C	0	17.245821
18	3	female	27.0	1	0	7.9250	S	1	73.813728
19	3	female	45.0	0	0	7.2250	C	1	68.709671

In [303]:

#get total number of 1's
print("Total number of surviving passangers in the test data:",len(df_test2[df_test2['Survived']==1]))

Total number of surviving passangers in the test data: 159

Insights¶

Any passanagers with a probability of survival above 50% recieved a 1
From the test data set of 418 passengaers 213 were expected to survive.
Since we know from the original training data only about 40% of the passengers survived, we can adjusts the models threshold to get results coloser to what we might expect.

In [304]:

#Adjusting the model threshold

#Setting the threshold
threshold=0.5

#Get data frame of passengers likely to survive.
likely_to_survive=pd.DataFrame(df_test2[df_test2['prob_of_survival']>=threshold*100])

#Calculate and display the survivors and the percent of the total.
print("Out of", len(prob_of_survival), "passangers.", len(likely_to_survive), "are likely to survive.")
print(round(len(likely_to_survive)/len(prob_of_survival)*100,2),"% likely to survive.")

Out of 418 passangers. 159 are likely to survive.
38.04 % likely to survive.

In [305]:

#Display list of passengers likely to survive from the test data set.
likely_to_survive.head(10)

Out[305]:

	Pclass	Sex	Age	SibSp	Parch	Fare	Embarked	Survived	prob_of_survival
6	3	female	30.0	0	0	7.6292	Q	1	57.990604
8	3	female	18.0	0	0	7.2292	C	1	68.863167
12	1	female	23.0	1	0	82.2667	S	1	97.533676
14	1	female	47.0	1	0	61.1750	S	1	96.439850
15	2	female	24.0	1	0	27.7208	C	1	91.883820
18	3	female	27.0	1	0	7.9250	S	1	73.813728
19	3	female	45.0	0	0	7.2250	C	1	68.709671
20	1	male	55.0	1	0	59.4000	C	1	54.749870
22	1	female	NaN	0	0	31.6833	S	1	96.679016
24	1	female	48.0	1	3	262.3750	C	1	96.181473

Insights¶

If we use the default threshold of 0.5 159 passengers would be likely to survive from the total 418 passengers.
That is a 38.04% survival rate much closer to the training sets survival rate of 39%.