Table of Contents¶
Context:¶
The sinking of the Titanic is one of the most infamous shipwrecks in history.
On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.
While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.
In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).
# Necessary Imports
# Library to suppress warnings or deprecation notes
import warnings
warnings.filterwarnings("ignore")
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
import statistics as stats
import scipy.stats as spstats
# Library to split data
from sklearn.model_selection import train_test_split
# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(color_codes=True) # adds a nice background to the graphs
%matplotlib inline
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# Libraries to build decision tree classifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
# Libraries to build decision tree regressor
from sklearn.ensemble import (
BaggingClassifier,
RandomForestClassifier,
GradientBoostingClassifier,
AdaBoostClassifier,
StackingClassifier,
)
# To tune different models
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
# To build model for prediction
import statsmodels.stats.api as sms
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm
from statsmodels.tools.tools import add_constant
from sklearn.linear_model import LogisticRegression
# To get diferent metric scores
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
#plot_confusion_matrix,
make_scorer,
precision_recall_curve,
roc_curve,
)
# To impute missing values
from sklearn.impute import SimpleImputer,KNNImputer
# To undersample and oversample the data
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline, make_pipeline
# To be used for data scaling and one hot encoding
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
from xgboost import XGBClassifier
# import data
data_train = pd.read_csv("train.csv")
df=data_train.copy()
# display data head
df.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
# import data
data_test = pd.read_csv("test.csv")
# display data head
data_test.head()
PassengerId | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 892 | 3 | Kelly, Mr. James | male | 34.5 | 0 | 0 | 330911 | 7.8292 | NaN | Q |
1 | 893 | 3 | Wilkes, Mrs. James (Ellen Needs) | female | 47.0 | 1 | 0 | 363272 | 7.0000 | NaN | S |
2 | 894 | 2 | Myles, Mr. Thomas Francis | male | 62.0 | 0 | 0 | 240276 | 9.6875 | NaN | Q |
3 | 895 | 3 | Wirz, Mr. Albert | male | 27.0 | 0 | 0 | 315154 | 8.6625 | NaN | S |
4 | 896 | 3 | Hirvonen, Mrs. Alexander (Helga E Lindqvist) | female | 22.0 | 1 | 1 | 3101298 | 12.2875 | NaN | S |
#get the shape of the data
df.shape
(891, 12)
# Use info() to print a summary of the DataFrame
df.info()
RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PassengerId 891 non-null int64 1 Survived 891 non-null int64 2 Pclass 891 non-null int64 3 Name 891 non-null object 4 Sex 891 non-null object 5 Age 714 non-null float64 6 SibSp 891 non-null int64 7 Parch 891 non-null int64 8 Ticket 891 non-null object 9 Fare 891 non-null float64 10 Cabin 204 non-null object 11 Embarked 889 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 83.7+ KB
# Check for columns that have missing values
df.isnull().sum().sort_values(ascending=False)
Cabin 687 Age 177 Embarked 2 PassengerId 0 Survived 0 Pclass 0 Name 0 Sex 0 SibSp 0 Parch 0 Ticket 0 Fare 0 dtype: int64
Insights¶
- There are missing values in Age, Cabin, and Embark. We will need to imput these missing values in some way.
# Coverting the following to "category"
cat_columns = df.select_dtypes(include=["object", "category"]).columns.tolist()
for col in cat_columns:
df[col] = df[col].astype("category")
# Use info() to print a summary of the DataFrame
df.info()
RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PassengerId 891 non-null int64 1 Survived 891 non-null int64 2 Pclass 891 non-null int64 3 Name 891 non-null category 4 Sex 891 non-null category 5 Age 714 non-null float64 6 SibSp 891 non-null int64 7 Parch 891 non-null int64 8 Ticket 891 non-null category 9 Fare 891 non-null float64 10 Cabin 204 non-null category 11 Embarked 889 non-null category dtypes: category(5), float64(2), int64(5) memory usage: 122.0 KB
# Drop the PassengerId column as it is not needed
df.drop("PassengerId", axis=1, inplace=True)
# Statistical summary of the data
df.describe(include="all").T
count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|---|---|---|
Survived | 891.0 | NaN | NaN | NaN | 0.383838 | 0.486592 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
Pclass | 891.0 | NaN | NaN | NaN | 2.308642 | 0.836071 | 1.0 | 2.0 | 3.0 | 3.0 | 3.0 |
Name | 891 | 891 | Abbing, Mr. Anthony | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Sex | 891 | 2 | male | 577 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Age | 714.0 | NaN | NaN | NaN | 29.699118 | 14.526497 | 0.42 | 20.125 | 28.0 | 38.0 | 80.0 |
SibSp | 891.0 | NaN | NaN | NaN | 0.523008 | 1.102743 | 0.0 | 0.0 | 0.0 | 1.0 | 8.0 |
Parch | 891.0 | NaN | NaN | NaN | 0.381594 | 0.806057 | 0.0 | 0.0 | 0.0 | 0.0 | 6.0 |
Ticket | 891 | 681 | 1601 | 7 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Fare | 891.0 | NaN | NaN | NaN | 32.204208 | 49.693429 | 0.0 | 7.9104 | 14.4542 | 31.0 | 512.3292 |
Cabin | 204 | 147 | C23 C25 C27 | 4 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Embarked | 889 | 3 | S | 644 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
# declares the DrawHist function
def DrawHist(data, feature, save_image=False):
"""Takes in the dataframe and feature you want to get a histogram of."""
# creates histplot for the feature
plt.figure(figsize=(15, 3))
sns.histplot(data=data, x=feature, kde=True)
# plots a green line for the mean
plt.axvline(x=data[feature].mean(), c="green", label="mean")
# plots a red line for the median
plt.axvline(x=data[feature].median(), c="red", label="median")
# Adds a legend
plt.legend(ncol=1, loc="upper right", frameon=True)
# Adds a title to the histogram
plt.title("Histogram: " + feature, fontdict={"fontsize": 20}, pad=10)
# save to image
if save_image:
plt.savefig(
"Histogram_" + feature + ".jpg", bbox_inches="tight", pad_inches=0.5
)
# declares the DrawHist function
def DrawBox(data, feature, save_image=False):
"""Takes in the dataframe and feature you want to get a boxplot of."""
# creates boxplot
plt.figure(figsize=(15, 1))
sns.boxplot(data=data, x=feature)
# adds a title to the plot
plt.title("Box Plot: " + feature, fontdict={"fontsize": 20}, pad=10)
# save to image
if save_image:
plt.savefig("BoxPlot_" + feature + ".jpg", bbox_inches="tight", pad_inches=0.5)
# declares the DrawCountPlot function
def DrawCountPlot(data, feature, num=10, save_image=False):
"""Takes in a dataframe, feature, and number of items to display on the x axis."""
# creates a count plot
plt.figure(figsize=(15, 3)) # To resize the plot
sns.countplot(
data=data, x=feature, order=data[feature].value_counts().iloc[:num].index
)
# rotates the ticks on the x axis
plt.xticks(rotation=90)
# adds a title to the plot
plt.title("Count Plot: " + feature, fontdict={"fontsize": 20}, pad=10)
# save to image
if save_image:
plt.savefig(
"CountPlot_" + feature + ".jpg", bbox_inches="tight", pad_inches=0.5
)
plt.show()
# declares the DrawCountPlot function
def DrawStackedBarPlot(data, feature_x, feature_y, max_elements=10, save_image=False):
num_elements = len(data[feature_x].unique())
if (num_elements <= max_elements) and (feature_x!=feature_y):
# stacked bar plot
tab = pd.crosstab(data[feature_x], data[feature_y], normalize="index").sort_values(
by=data[feature_y].value_counts().index[-1], ascending=True
)
tab.plot(kind="barh", stacked=True, figsize=(20, num_elements / 2))
# adds legend to plot
plt.legend(loc="upper left", bbox_to_anchor=(1, 1), frameon=False)
# adds title to plot
plt.title(feature_x + " vs " + feature_y, fontdict={"fontsize": 25}, pad=10)
# save to image
if save_image:
plt.savefig(
"Stacked_Bar_Plot" + "_" + feature_y + "_" + feature_x + ".jpg",
bbox_inches="tight",
pad_inches=0.5,
)
# defines a function to perform a univariate analysis using histplot,
# and boxplot for numerical feature, and countplot for categorical features
def Univariate_Analysis(
data, features, num=15, save_image=False, display_categorical=True
):
# creates a copy of the data
cData = data.copy()
for col in cData[features]:
if cData[col].dtypes == "float64" or cData[col].dtypes == "int64":
# creates histplot of feature
DrawHist(data=cData, feature=col, save_image=save_image)
# creates boxplot of feature
DrawBox(data=cData, feature=col, save_image=save_image)
else:
if display_categorical:
# creates countplot of feature
DrawCountPlot(data=cData, feature=col, num=num, save_image=save_image)
# Perform Univariate Analysis
Univariate_Analysis(data=df, features=df.columns, num=25)
print(round(len(df[df['Survived']==0])/len(df)*100,2),"% of passangers died in the training data.")
61.62 % of passangers died in the training data.
Insights¶
- More passangers died than survived.
- There were more passangers in 3rd class than both 1st and 2nd combined.
- About twice as many passanger were male.
- The mean age was 30 year old. Age contains a slight right tail.
- SibSp is quite skewed with a long right tail. Some outliers appear to be present.
- Parch also has a large right tail.
- The mean and median fare is quite low with a few very high outliers.
- Vast majority Embarked from S.
# heat map of numerical features.
fig, ax = plt.subplots(figsize=(20, 5))
sns.heatmap(
df.corr(), ax=ax, annot=True, linewidths=0.05, fmt=".2f", cmap="magma"
) # the color intensity is based on
# plt.savefig('HeatMap.jpg',bbox_inches ="tight",pad_inches = 0.5)
plt.show()
Insights¶
- The strongest corrilation with survived is with PClass -0.34.
- A positive corrilation with survived and Fare of 0.26.
- The strongest positive corrilation is between Prach and SibSp 0.41.
- A strong negative corrilation between Pclass and Fare -0.55.
# Define a function to perform a Bivariate analysis with a chosen feature across all other categorical features.
# Then creates a pairplots with the chosen feature across all numerical features.
def Bivariate_Analysis(data, feature, max_elements=10, save_image=False):
plt.figure(figsize=(20, 5))
# Box plot of categorical variables against the chosen feature
# push all categorical features into a dataframe
categorical_cols = data.select_dtypes(include=["category", "object"]).columns
for col in categorical_cols:
# Draw a stacked bar plot
DrawStackedBarPlot(
data=data,
feature_x=col,
feature_y=feature,
max_elements=max_elements,
save_image=save_image,
)
# Pair plot of numarical variables against the chosen feature
# push all numerical features into a dataframe
numarical_cols = data.select_dtypes(include=["int", "float"]).columns
n = 0
while n <= len(numarical_cols) - 5:
sns.pairplot(
data=data, x_vars=numarical_cols[n : n + 5], y_vars=feature, kind="reg"
)
# plt.title('Pair Plot: ' + feature + ' vs Numarical Columns',fontdict = {'fontsize' : 20},pad=10)
n = n + 5
# Catch if the number of numarical columns is less than 5
if len(numarical_cols) <= 5:
sns.pairplot(
data=data, x_vars=numarical_cols[n : n + 5], y_vars=feature, kind="reg"
)
# save to image for the pairplot
if save_image:
plt.savefig(
"PairPlot" + "_" + feature + ".jpg", bbox_inches="tight", pad_inches=0.5
)
# performs a bivariate analysis
Bivariate_Analysis(data=df, feature="Survived", max_elements=35)
Insights¶
- Way more women survived than men.
- More survived if embarked from C
- More first class passanger survived than 3rd or 3nd class.
- As age increased so did the rate of death.
- as Parch increases so does the rate of survival.
Key Takeaways¶
- The passangers that have the greatest likely hood of survival are 1st class young women.
- I feel the most important drivers of survival are pClass, and gender.
- remove Name feature
- Convert Cabin feature to Deck feature Example Deck A, through Deck G.
- impute missing values.
- Bin Ticket some how. May just remove for now.
- Create dummies for Sex, Embark, and Deck
#make copy of dataframe
df_processed=df.copy()
#Drop features
df_processed.drop(['Name','Ticket'],axis=1,inplace=True)
df_processed.head()
Survived | Pclass | Sex | Age | SibSp | Parch | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | NaN | S |
1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C85 | C |
2 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | NaN | S |
3 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | C123 | S |
4 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 | NaN | S |
#Covert Cabin to Deck
df_processed['Deck']=df_processed['Cabin'].apply(lambda x : x[:1])
#Remove Cabin
df_processed.drop(['Cabin'],axis=1,inplace=True)
df_processed.head()
Survived | Pclass | Sex | Age | SibSp | Parch | Fare | Embarked | Deck | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | S | NaN |
1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C | C |
2 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | S | NaN |
3 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | S | C |
4 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 | S | NaN |
df_processed['Deck'].value_counts()
C 59 B 47 D 33 E 32 A 15 F 13 G 4 T 1 Name: Deck, dtype: int64
#define function to replace values in a feature
def replace_with(data,features=[],replace_struc=[],revert=False):
# do the replacing
for feature in features:
replacement=replace_struc[features.index(feature)]
if(revert==True): replacement = {v: k for k, v in replacement.items()}
data[feature].replace(replacement, inplace=True)
return data
# setup a dictionary to do the replacing
Sex = {"male": 0, "female": 1}
Embarked = {"C": 1,"Q": 2,"S": 3}
Deck = {"T": 0,"A": 1,"B": 2,"C": 3,"D": 4,"E": 5,"F": 6,"G": 7}
#use replace_with on the features that are not numerical
df_processed = replace_with(df_processed,['Sex','Embarked','Deck'],[Sex,Embarked,Deck])
# display the head
df_processed.head()
Survived | Pclass | Sex | Age | SibSp | Parch | Fare | Embarked | Deck | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | 0 | 22.0 | 1 | 0 | 7.2500 | 3 | NaN |
1 | 1 | 1 | 1 | 38.0 | 1 | 0 | 71.2833 | 1 | 3.0 |
2 | 1 | 3 | 1 | 26.0 | 0 | 0 | 7.9250 | 3 | NaN |
3 | 1 | 1 | 1 | 35.0 | 1 | 0 | 53.1000 | 3 | 3.0 |
4 | 0 | 3 | 0 | 35.0 | 0 | 0 | 8.0500 | 3 | NaN |
#imput missing values in train and test data for V1 and V2
imputer = KNNImputer(n_neighbors=5)
col_to_impute=['Embarked','Sex','Age', 'Deck']
df_processed[col_to_impute] = imputer.fit_transform(df_processed[col_to_impute])
df_processed[col_to_impute]=round(df_processed[col_to_impute])
df_processed.head()
Survived | Pclass | Sex | Age | SibSp | Parch | Fare | Embarked | Deck | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | 0.0 | 22.0 | 1 | 0 | 7.2500 | 3.0 | 2.0 |
1 | 1 | 1 | 1.0 | 38.0 | 1 | 0 | 71.2833 | 1.0 | 3.0 |
2 | 1 | 3 | 1.0 | 26.0 | 0 | 0 | 7.9250 | 3.0 | 4.0 |
3 | 1 | 1 | 1.0 | 35.0 | 1 | 0 | 53.1000 | 3.0 | 3.0 |
4 | 0 | 3 | 0.0 | 35.0 | 0 | 0 | 8.0500 | 3.0 | 3.0 |
#reverse the replace_with
df_processed = replace_with(df_processed,['Sex','Embarked','Deck'],[Sex,Embarked,Deck],revert=True)
# display the head
df_processed.head()
Survived | Pclass | Sex | Age | SibSp | Parch | Fare | Embarked | Deck | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | S | B |
1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C | C |
2 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | S | D |
3 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | S | C |
4 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 | S | C |
# performs a bivariate analysis
Bivariate_Analysis(data=df_processed, feature="Survived", max_elements=35)
Insights:¶
- Female passangers are still greatly more likely to survive than male passangers.
- Passangers that embarked from C were more likely to survive.
- Passangers on deck E where more likely to survive. If we discount deck T (Tank Top), passangers from deck B saw the most deaths.
- By looking at decks A,B,C, compared to decks D,E,F one could say a passanger was more likely to survive if they stayed in on the lower desks as apposed to the upper decks.
- Pclass has a strong negative relationship with survival.
#Bin decks into upper and lower decks
#Upper decks will be defined as decks A,B,C
#Lower decks will be defined as decks D,E,F,G
# setup a dictionary to do the replacing
DeckLevel = {"T": "upper","A": "upper","B": "upper","C": "upper","D": "lower","E": "lower","F": "lower","G": "lower"}
df_processed['DeckLevel']=df_processed['Deck'];
#use replace_with on the features that are not numerical
df_processed = replace_with(df_processed,['DeckLevel'],[DeckLevel])
# display the head
df_processed.head()
Survived | Pclass | Sex | Age | SibSp | Parch | Fare | Embarked | Deck | DeckLevel | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | S | B | upper |
1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C | C | upper |
2 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | S | D | lower |
3 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | S | C | upper |
4 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 | S | C | upper |
# performs a bivariate analysis
Bivariate_Analysis(data=df_processed, feature="Survived", max_elements=35)
Insights¶
- as expected this shows that the lower decks had a greater likely hood of survival.
# defines a function to get the outlier powers
def Get_Outlier_Powers(data, feature, limits=[0.25, 0.75], print_quartiles=True):
# gets the quartiles of the feature based on the limits set
quartiles = np.quantile(data[feature][data[feature].notnull()], limits)
# calculates the power_4iqr
power_4iqr = 4 * (quartiles[1] - quartiles[0])
# print what Q1, Q3, and 4*IQR are if desired
if print_quartiles:
print(
"Q1 =", quartiles[0], "Q3 =", quartiles[1], "4*IQR =", round(power_4iqr, 2)
)
# calulates the outlier powers
outlier_powers = data.loc[
np.abs(data[feature] - data[feature].median()) > power_4iqr, feature
]
return outlier_powers
# Define a function to detect outliers in numerical columns and display a boxplot
def Detect_Outliers(data, features, limits=[0.25, 0.75], save_image=False):
cData = data.copy()
for col in cData[features]:
if cData[col].dtypes == "float64" or cData[col].dtypes == "int64":
# Find Quartiles
quartiles = np.quantile(cData[col][cData[col].notnull()], limits)
# calulate power_4iqr
power_4iqr = 4 * (quartiles[1] - quartiles[0])
# Draw BoxPlot
plt.figure(figsize=(15, 1))
sns.boxplot(data=cData, x=col)
plt.title(
col
+ " | "
+ " Q1 = "
+ str(quartiles[0])
+ " | Q3 = "
+ str(quartiles[1])
+ " | 4*IQR = "
+ str(round(power_4iqr, 2)),
fontdict={"fontsize": 20},
pad=10,
)
# save to image
if save_image:
plt.savefig(
"BoxPlot_" + feature + ".jpg", bbox_inches="tight", pad_inches=0.5
)
# define a function that will remove the outliers of the chosen features
# based on the set limits and their outlier powers
def Drop_Outliers(data, features, limits=[0.25, 0.75]):
# make of copy data
cData = data.copy()
for col in cData[features]:
# get outlier_powers
outlier_powers = Get_Outlier_Powers(
cData, col, limits=limits, print_quartiles=False
)
# drop the outliers from the data
cData.drop(outlier_powers.index, axis=0, inplace=True)
return cData
# Run Detect_Outliers
Detect_Outliers(data=df_processed, features=df_processed.columns, limits=[0.25, 0.75])
# # Run Drop_Outliers on df_processed_no_cat and store in df_processed_outliers_dropped
# df_processed_outliers_dropped = Drop_Outliers(
# data=df_processed,
# features=["Fare", "Parch", "SibSp"],
# limits=[0.1, 0.9],
# )
# # run Detect_Outliers again to get a view of the numerical features after dropping outliers
# Detect_Outliers(
# data=df_processed_outliers_dropped, features=df_processed_outliers_dropped.columns
# )
# # push a copy of df_processed_outliers_dropped in to df_cleaned
# df_cleaned = df_processed_outliers_dropped.copy()
# df_processed=df_cleaned
# Drop features
df_processed2=df_processed.copy();
df_processed2.drop(['Deck','DeckLevel'],axis=1,inplace=True)
df_modeling=df_processed2.copy()
# Separating features and the target column
X = df_modeling.drop("Survived", axis=1)
Y = df_modeling["Survived"]
# create dummies
X = pd.get_dummies(
X,
columns=X.select_dtypes(include=["object", "category"]).columns.tolist(),
drop_first=True
)
# display head
X.head()
Pclass | Age | SibSp | Parch | Fare | Sex_male | Embarked_Q | Embarked_S | |
---|---|---|---|---|---|---|---|---|
0 | 3 | 22.0 | 1 | 0 | 7.2500 | 1 | 0 | 1 |
1 | 1 | 38.0 | 1 | 0 | 71.2833 | 0 | 0 | 0 |
2 | 3 | 26.0 | 0 | 0 | 7.9250 | 0 | 0 | 1 |
3 | 1 | 35.0 | 1 | 0 | 53.1000 | 0 | 0 | 1 |
4 | 3 | 35.0 | 0 | 0 | 8.0500 | 1 | 0 | 1 |
# Splitting the data into train and test sets in 75:25 ratio
x_train, x_val, y_train, y_val = train_test_split(
X, Y, test_size=0.2, random_state=1, shuffle=True, stratify=Y
)
# get shape of training and test set
x_train.shape, x_val.shape
((712, 8), (179, 8))
# Print percentage of training set
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
# Print percentage of test set
print("Percentage of classes in test set:")
print(y_val.value_counts(normalize=True))
# setting the class weight
class_weighting = {
0: round(y_train.value_counts(normalize=True)[1], 2),
1: round(y_train.value_counts(normalize=True)[0], 2),
}
print("*" * 50)
print("Class Weighting:", class_weighting)
Percentage of classes in training set: 0 0.616573 1 0.383427 Name: Survived, dtype: float64 Percentage of classes in test set: 0 0.614525 1 0.385475 Name: Survived, dtype: float64 ************************************************** Class Weighting: {0: 0.38, 1: 0.62}
Original Models¶
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"Accuracy": acc,
"Recall": recall,
"Precision": precision,
"F1": f1
},
index=[0],
)
return df_perf
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
#define function to build all models and display results of choosen scorer
def BuildModels(models,x_train,y_train,score='recall'):
'''Builds all models and displays k-fold cross validation and results of validation set of choosen scorer.'''
#catch if no model defined then display message and return
if len(models)==0:
print('please pass in atleast 1 model.')
return
#nessessary imports if not already done yet
import sklearn.metrics as metrics
from sklearn.metrics import (
recall_score,
accuracy_score,
precision_score,
f1_score,
)
from sklearn.model_selection import StratifiedKFold, cross_val_score
#define scorer
if(score=='recall'): scorer = metrics.make_scorer(metrics.recall_score)
if(score=='accuracy'): scorer = metrics.make_scorer(metrics.accuracy_score)
if(score=='precision'): scorer = metrics.make_scorer(metrics.precision_score)
if(score=='f1'): scorer = metrics.make_scorer(metrics.f1_score)
results = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
fitted_models=[] # Empty list to store all fitted models
# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation Cost:" "\n")
for name, model in models:
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result = cross_val_score(
estimator=model, X=x_train, y=y_train, scoring=scorer, cv=kfold
)
results.append(cv_result)
names.append(name)
print("{}: {}".format(name, cv_result.mean()))
# loop through all models to get the Validation Performance
print("\n" "Validation Performance:" "\n")
for name, model in models:
model.fit(x_train, y_train)
scores = recall_score(y_val, model.predict(x_val))
print("{}: {}".format(name, scores))
fitted_models.append(model)
return fitted_models
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.accuracy_score)
#define all model we will be building
models = [] # Empty list to store all the models
# Appending models into the list
models.append(("Logistic Regression", LogisticRegression(random_state=1)))
models.append(("Decision Tree", DecisionTreeClassifier(random_state=1)))
models.append(("Random Forest", RandomForestClassifier(random_state=1)))
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("GBM", GradientBoostingClassifier(random_state=1)))
models.append(("XGB", XGBClassifier(random_state=1)))
models.append(("Adaboost", AdaBoostClassifier(random_state=1)))
#build models using original data
models_original=BuildModels(models,x_train=x_train,y_train=y_train,score='accuracy')
Cross-Validation Cost: Logistic Regression: 0.8047572146163695 Decision Tree: 0.7612035851472471 Random Forest: 0.796395154141633 Bagging: 0.7935684034275583 GBM: 0.825824879346006 XGB: 0.7892839554811386 Adaboost: 0.8047670639219936 Validation Performance: Logistic Regression: 0.6956521739130435 Decision Tree: 0.6956521739130435 Random Forest: 0.7536231884057971 Bagging: 0.7101449275362319 GBM: 0.7681159420289855 XGB: 0.7971014492753623 Adaboost: 0.6666666666666666
Insights¶
- The best model under Cross-Validation Cost is GBM
- The best model under the validation Perfomance is XBG
#Display the validation performance across all models using the original data.
for model in models_original:
name=type(model).__name__
mod_pref=model_performance_classification_sklearn(model=model,predictors=x_val,target=y_val)
print(name,"\n",mod_pref,"\n")
LogisticRegression Accuracy Recall Precision F1 0 0.782123 0.695652 0.727273 0.711111 DecisionTreeClassifier Accuracy Recall Precision F1 0 0.793296 0.695652 0.75 0.721805 RandomForestClassifier Accuracy Recall Precision F1 0 0.832402 0.753623 0.8 0.776119 BaggingClassifier Accuracy Recall Precision F1 0 0.798883 0.710145 0.753846 0.731343 GradientBoostingClassifier Accuracy Recall Precision F1 0 0.849162 0.768116 0.828125 0.796992 XGBClassifier Accuracy Recall Precision F1 0 0.843575 0.797101 0.797101 0.797101 AdaBoostClassifier Accuracy Recall Precision F1 0 0.776536 0.666667 0.730159 0.69697
Insights¶
- Here the GBM is better than the XGB but only very slightly.
# Synthetic Minority Over Sampling Technique
sm = SMOTE(sampling_strategy=1, k_neighbors=5, random_state=1)
x_train_over, y_train_over = sm.fit_resample(x_train, y_train)
print("Before OverSampling, count of label '1': {}".format(sum(y_train == 1)))
print("Before OverSampling, count of label '0': {} \n".format(sum(y_train == 0)))
print("After OverSampling, count of label '1': {}".format(sum(y_train_over == 1)))
print("After OverSampling, count of label '0': {} \n".format(sum(y_train_over == 0)))
print("After OverSampling, the shape of train_X: {}".format(x_train_over.shape))
print("After OverSampling, the shape of train_y: {} \n".format(y_train_over.shape))
Before OverSampling, count of label '1': 273 Before OverSampling, count of label '0': 439 After OverSampling, count of label '1': 439 After OverSampling, count of label '0': 439 After OverSampling, the shape of train_X: (878, 8) After OverSampling, the shape of train_y: (878,)
#build models using oversample data
models_over=BuildModels(models,x_train=x_train_over,y_train=y_train_over,score='accuracy')
Cross-Validation Cost: Logistic Regression: 0.8177857142857142 Decision Tree: 0.8017922077922076 Random Forest: 0.8303246753246754 Bagging: 0.8246233766233766 GBM: 0.8314480519480518 XGB: 0.8360194805194805 Adaboost: 0.8257922077922079 Validation Performance: Logistic Regression: 0.7101449275362319 Decision Tree: 0.7681159420289855 Random Forest: 0.7391304347826086 Bagging: 0.7681159420289855 GBM: 0.7681159420289855 XGB: 0.7971014492753623 Adaboost: 0.7246376811594203
Insights¶
- When over sampled data is used XGB model is the best.
#Display the validation performance across all models using the oversample data.
for model in models_over:
name=type(model).__name__
mod_pref=model_performance_classification_sklearn(model=model,predictors=x_val,target=y_val)
print(name,"\n",mod_pref,"\n")
LogisticRegression Accuracy Recall Precision F1 0 0.776536 0.710145 0.710145 0.710145 DecisionTreeClassifier Accuracy Recall Precision F1 0 0.804469 0.768116 0.736111 0.751773 RandomForestClassifier Accuracy Recall Precision F1 0 0.804469 0.73913 0.75 0.744526 BaggingClassifier Accuracy Recall Precision F1 0 0.826816 0.768116 0.779412 0.773723 GradientBoostingClassifier Accuracy Recall Precision F1 0 0.826816 0.768116 0.779412 0.773723 XGBClassifier Accuracy Recall Precision F1 0 0.843575 0.797101 0.797101 0.797101 AdaBoostClassifier Accuracy Recall Precision F1 0 0.776536 0.724638 0.704225 0.714286
#Display the confusion matrix for the random forest model
rForest_over=RandomForestClassifier(random_state=1)
rForest_over.fit(x_train_over,y_train_over)
confusion_matrix_sklearn(model=rForest_over,predictors=x_val,target=y_val)
# Random undersampler for under sampling the data
rus = RandomUnderSampler(random_state=1, sampling_strategy=1)
x_train_under, y_train_under = rus.fit_resample(x_train, y_train)
print("Before Under Sampling, count of label '1': {}".format(sum(y_train == 1)))
print("Before Under Sampling, count of label '0': {} \n".format(sum(y_train == 0)))
print("After Under Sampling, count of label '1': {}".format(sum(y_train_under == 1)))
print("After Under Sampling, count of label '0': {} \n".format(sum(y_train_under == 0)))
print("After Under Sampling, the shape of train_X: {}".format(x_train_under.shape))
print("After Under Sampling, the shape of train_y: {} \n".format(y_train_under.shape))
Before Under Sampling, count of label '1': 273 Before Under Sampling, count of label '0': 439 After Under Sampling, count of label '1': 273 After Under Sampling, count of label '0': 273 After Under Sampling, the shape of train_X: (546, 8) After Under Sampling, the shape of train_y: (546,)
#build models using undersample data
models_under=BuildModels(models,x_train=x_train_under,y_train=y_train_under,score='accuracy')
Cross-Validation Cost: Logistic Regression: 0.8021351125938281 Decision Tree: 0.7452376980817348 Random Forest: 0.7801501251042535 Bagging: 0.7709924937447873 GBM: 0.7783319432860718 XGB: 0.7618181818181817 Adaboost: 0.7746622185154296 Validation Performance: Logistic Regression: 0.7391304347826086 Decision Tree: 0.7536231884057971 Random Forest: 0.782608695652174 Bagging: 0.782608695652174 GBM: 0.8115942028985508 XGB: 0.8260869565217391 Adaboost: 0.7391304347826086
Insights¶
- When under sampled data is used XGB and GBM are about equal.
#Display the validation performance across all models using the undersample data.
for model in models_under:
name=type(model).__name__
mod_pref=model_performance_classification_sklearn(model=model,predictors=x_val,target=y_val)
print(name,"\n",mod_pref,"\n")
LogisticRegression Accuracy Recall Precision F1 0 0.776536 0.73913 0.69863 0.71831 DecisionTreeClassifier Accuracy Recall Precision F1 0 0.77095 0.753623 0.684211 0.717241 RandomForestClassifier Accuracy Recall Precision F1 0 0.798883 0.782609 0.72 0.75 BaggingClassifier Accuracy Recall Precision F1 0 0.826816 0.782609 0.771429 0.776978 GradientBoostingClassifier Accuracy Recall Precision F1 0 0.826816 0.811594 0.756757 0.783217 XGBClassifier Accuracy Recall Precision F1 0 0.765363 0.826087 0.655172 0.730769 AdaBoostClassifier Accuracy Recall Precision F1 0 0.77095 0.73913 0.689189 0.713287
#Display the confusion matrix for the random forest model
rForest_under=RandomForestClassifier(random_state=1)
rForest_under.fit(x_train_under,y_train_under)
confusion_matrix_sklearn(model=rForest_under,predictors=x_val,target=y_val)
RandomForestClassifier¶
%%time
# defining model
Model = RandomForestClassifier(random_state=1,class_weight=class_weighting)
# Parameter grid to pass in RandomSearchCV
param_grid = {
"n_estimators": np.arange(10, 100,10),
"min_samples_leaf": np.arange(1, 8,1),
"max_samples": np.arange(0.2, 1, 0.1),
"max_features": np.arange(0.1, 1, 0.1),
"max_depth": np.arange(10, 30, 1)
}
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(
estimator=Model, param_distributions=param_grid, n_iter=5, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(x_train,y_train)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'n_estimators': 90, 'min_samples_leaf': 2, 'max_samples': 0.6000000000000001, 'max_features': 0.2, 'max_depth': 26} with CV score=0.8020191076529104: CPU times: user 159 ms, sys: 169 ms, total: 327 ms Wall time: 2.59 s
#Build Base Model
rForest_original= RandomForestClassifier(random_state=1,class_weight=class_weighting)
# Fit the model on training data
rForest_original.fit(x_train, y_train)
# Creating new model with best parameters
rForest_tuned_original = RandomForestClassifier(
n_estimators=90,
min_samples_leaf= 2,
max_samples= 0.6,
max_features= 0.2,
max_depth= 26
)
# Fit the model on training data
rForest_tuned_original.fit(x_train, y_train)
RandomForestClassifier(max_depth=26, max_features=0.2, max_samples=0.6, min_samples_leaf=2, n_estimators=90)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(max_depth=26, max_features=0.2, max_samples=0.6, min_samples_leaf=2, n_estimators=90)
#display performance of base model on validation set
print('Original')
pref_rForest_original = model_performance_classification_sklearn(
model=rForest_original,predictors=x_val,target=y_val)
print(pref_rForest_original)
#display performance of tuned model on validation set
print('Tuned')
pref_rForest_original_tuned=model_performance_classification_sklearn(
model=rForest_tuned_original,predictors=x_val,target=y_val)
print(pref_rForest_original_tuned)
#display performance of tuned model on training set
print('Tuned Performance Training')
pref_rForest_original_tuned_train=model_performance_classification_sklearn(
model=rForest_tuned_original,predictors=x_train,target=y_train)
print(pref_rForest_original_tuned_train)
Original Accuracy Recall Precision F1 0 0.821229 0.73913 0.784615 0.761194 Tuned Accuracy Recall Precision F1 0 0.843575 0.768116 0.815385 0.791045 Tuned Performance Training Accuracy Recall Precision F1 0 0.907303 0.827839 0.922449 0.872587
GBM¶
%%time
# defining model
Model = GradientBoostingClassifier(random_state=1)
param_grid = {
"n_estimators": np.arange(50,150,10), #100
"max_features":np.arange(0.5,1.5,0.5),
"max_depth":np.arange(1,10,1), #3
"max_leaf_nodes":np.arange(1,10,1),
"learning_rate": np.arange(0.05,0.5,0.05), #0.1
"subsample":np.arange(0.1,1.5,0.1),
}
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(
estimator=Model, param_distributions=param_grid, n_iter=5, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(x_train,y_train)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.8, 'n_estimators': 100, 'max_leaf_nodes': 7, 'max_features': 1.0, 'max_depth': 2, 'learning_rate': 0.15000000000000002} with CV score=0.8174825174825175: CPU times: user 113 ms, sys: 10.4 ms, total: 124 ms Wall time: 402 ms
#Build Base Model
GBM_original= GradientBoostingClassifier(random_state=1)
# Fit the model on training data
GBM_original.fit(x_train, y_train)
# Creating new model with best parameters
GBM_tuned_original = GradientBoostingClassifier(
subsample= 0.7,
n_estimators= 75,
max_leaf_nodes= 5,
max_features= 0.5,
max_depth= 3,
learning_rate= 0.1
)
# Fit the model on training data
GBM_tuned_original.fit(x_train, y_train)
GradientBoostingClassifier(max_features=0.5, max_leaf_nodes=5, n_estimators=75, subsample=0.7)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GradientBoostingClassifier(max_features=0.5, max_leaf_nodes=5, n_estimators=75, subsample=0.7)
#display performance of base model on validation set
print('Original')
pref_GBM_original = model_performance_classification_sklearn(
model=GBM_original,predictors=x_val,target=y_val)
print(pref_GBM_original)
#display performance of tuned model on validation set
print('Tuned')
pref_GBM_original_tuned=model_performance_classification_sklearn(
model=GBM_tuned_original,predictors=x_val,target=y_val)
print(pref_GBM_original_tuned)
#display performance of tuned model on training set
print('Tuned Performance Training')
pref_GBM_original_tuned_train=model_performance_classification_sklearn(
model=GBM_tuned_original,predictors=x_train,target=y_train)
print(pref_GBM_original_tuned_train)
Original Accuracy Recall Precision F1 0 0.849162 0.768116 0.828125 0.796992 Tuned Accuracy Recall Precision F1 0 0.821229 0.710145 0.803279 0.753846 Tuned Performance Training Accuracy Recall Precision F1 0 0.869382 0.78022 0.865854 0.820809
XGB¶
%%time
# defining model
Model = XGBClassifier(random_state=1)
param_grid = {
"n_estimators": np.arange(10,150,10), #100
"max_depth":np.arange(1,10,1), #3
"learning_rate": np.arange(0.05,0.5,0.05), #0.1
"subsample":np.arange(0.1,1.5,0.1),
}
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(
estimator=Model, param_distributions=param_grid, n_iter=5, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(x_train,y_train)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.7000000000000001, 'n_estimators': 20, 'max_depth': 9, 'learning_rate': 0.35000000000000003} with CV score=0.7950261006599034: CPU times: user 159 ms, sys: 31.7 ms, total: 191 ms Wall time: 204 ms
#Build Base Model
XGB_original= XGBClassifier(random_state=1)
# Fit the model on training data
XGB_original.fit(x_train, y_train)
# Creating new model with best parameters
XGB_tuned_original = XGBClassifier(
subsample= 0.7,
n_estimators= 20,
max_depth= 9,
learning_rate= 0.35
)
# Fit the model on training data
XGB_tuned_original.fit(x_train, y_train)
XGBClassifier(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=None, early_stopping_rounds=None, enable_categorical=False, eval_metric=None, feature_types=None, gamma=None, gpu_id=None, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=0.35, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=9, max_leaves=None, min_child_weight=None, missing=nan, monotone_constraints=None, n_estimators=20, n_jobs=None, num_parallel_tree=None, predictor=None, random_state=None, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
XGBClassifier(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=None, early_stopping_rounds=None, enable_categorical=False, eval_metric=None, feature_types=None, gamma=None, gpu_id=None, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=0.35, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=9, max_leaves=None, min_child_weight=None, missing=nan, monotone_constraints=None, n_estimators=20, n_jobs=None, num_parallel_tree=None, predictor=None, random_state=None, ...)
#display performance of base model on validation set
print('Original')
pref_XGB_original = model_performance_classification_sklearn(
model=XGB_original,predictors=x_val,target=y_val)
print(pref_XGB_original)
#display performance of tuned model on validation set
print('Tuned')
pref_XGB_original_tuned=model_performance_classification_sklearn(
model=XGB_tuned_original,predictors=x_val,target=y_val)
print(pref_XGB_original_tuned)
#display performance of tuned model on training set
print('Tuned Performance Training')
pref_XGB_original_tuned_train=model_performance_classification_sklearn(
model=XGB_tuned_original,predictors=x_train,target=y_train)
print(pref_XGB_original_tuned_train)
Original Accuracy Recall Precision F1 0 0.843575 0.797101 0.797101 0.797101 Tuned Accuracy Recall Precision F1 0 0.854749 0.782609 0.830769 0.80597 Tuned Performance Training Accuracy Recall Precision F1 0 0.928371 0.868132 0.940476 0.902857
Over Sampling¶
XGB¶
%%time
# defining model
Model = XGBClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {
"n_estimators": np.arange(10,150,10), #100
"max_depth":np.arange(1,10,1), #3
"learning_rate": np.arange(0.05,0.5,0.01), #0.1
"subsample":np.arange(0.1,1.5,0.1),
}
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=5, n_jobs = -1,
scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(x_train_over,y_train_over)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.7000000000000001, 'n_estimators': 60, 'max_depth': 2, 'learning_rate': 0.17000000000000004} with CV score=0.8383311688311688: CPU times: user 186 ms, sys: 45.3 ms, total: 232 ms Wall time: 214 ms
#Build base model
XGB_over= XGBClassifier(random_state=1)
# Fit the model on training data
XGB_over.fit(x_train_over, y_train_over)
# Creating new model with best parameters
XGB_tuned_over = XGBClassifier(
subsample= 0.7,
n_estimators= 20,
max_depth= 9,
learning_rate= 0.35
)
# Fit the model on training data
XGB_tuned_over.fit(x_train_over, y_train_over)
XGBClassifier(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=None, early_stopping_rounds=None, enable_categorical=False, eval_metric=None, feature_types=None, gamma=None, gpu_id=None, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=0.35, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=9, max_leaves=None, min_child_weight=None, missing=nan, monotone_constraints=None, n_estimators=20, n_jobs=None, num_parallel_tree=None, predictor=None, random_state=None, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
XGBClassifier(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=None, early_stopping_rounds=None, enable_categorical=False, eval_metric=None, feature_types=None, gamma=None, gpu_id=None, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=0.35, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=9, max_leaves=None, min_child_weight=None, missing=nan, monotone_constraints=None, n_estimators=20, n_jobs=None, num_parallel_tree=None, predictor=None, random_state=None, ...)
#display performance of base model on validation set
print('Original')
pref_XGB_over = model_performance_classification_sklearn(
model=XGB_over,predictors=x_val,target=y_val)
print(pref_XGB_over)
#display performance of tuned model on validation set
print('Tuned')
pref_XGB_over_tuned=model_performance_classification_sklearn(
model=XGB_tuned_over,predictors=x_val,target=y_val)
print(pref_XGB_over_tuned)
#display performance of tuned model on training set
print('\nTuned Training Performance')
pref_XGB_over_tuned_train=model_performance_classification_sklearn(
model=XGB_tuned_over,predictors=x_train_over,target=y_train_over)
print(pref_XGB_over_tuned_train)
#Display the confusion matrix for the random forest model
confusion_matrix_sklearn(model=XGB_tuned_over,predictors=x_val,target=y_val)
Original Accuracy Recall Precision F1 0 0.843575 0.797101 0.797101 0.797101 Tuned Accuracy Recall Precision F1 0 0.860335 0.826087 0.814286 0.820144 Tuned Training Performance Accuracy Recall Precision F1 0 0.952164 0.943052 0.960557 0.951724
Under Sampling¶
XGB¶
%%time
# defining model
Model = XGBClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {
"subsample":np.arange(0.1,1,0.1),
"n_estimators": np.arange(10,150,10),
"max_depth":np.arange(1,10,1),
"learning_rate": np.arange(0.05,0.5,0.05),
}
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=5, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(x_train_under,y_train_under)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.2, 'n_estimators': 130, 'max_depth': 2, 'learning_rate': 0.05} with CV score=0.792977481234362: CPU times: user 282 ms, sys: 52.2 ms, total: 334 ms Wall time: 198 ms
#Build base model
XGB_under= XGBClassifier(random_state=1)
# Fit the model on training data
XGB_under.fit(x_train_under, y_train_under)
# Creating new model with best parameters
XGB_tuned_under = XGBClassifier(
subsample= 0.7,
n_estimators= 20,
max_depth= 9,
learning_rate= 0.35
)
# Fit the model on training data
XGB_tuned_under.fit(x_train_under, y_train_under)
XGBClassifier(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=None, early_stopping_rounds=None, enable_categorical=False, eval_metric=None, feature_types=None, gamma=None, gpu_id=None, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=0.35, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=9, max_leaves=None, min_child_weight=None, missing=nan, monotone_constraints=None, n_estimators=20, n_jobs=None, num_parallel_tree=None, predictor=None, random_state=None, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
XGBClassifier(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=None, early_stopping_rounds=None, enable_categorical=False, eval_metric=None, feature_types=None, gamma=None, gpu_id=None, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=0.35, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=9, max_leaves=None, min_child_weight=None, missing=nan, monotone_constraints=None, n_estimators=20, n_jobs=None, num_parallel_tree=None, predictor=None, random_state=None, ...)
#display performance of base model on validation set
print('Original')
pref_XGB_under = model_performance_classification_sklearn(
model=XGB_under,predictors=x_val,target=y_val)
print(pref_XGB_under)
#display performance of tuned model on validation set
print('Tuned')
pref_XGB_under_tuned=model_performance_classification_sklearn(
model=XGB_tuned_under,predictors=x_val,target=y_val)
print(pref_XGB_under_tuned)
#display performance of tuned model on training set
print('\nTuned Training Performance')
pref_XGB_under_tuned_train=model_performance_classification_sklearn(
model=XGB_tuned_under,predictors=x_train_under,target=y_train_under)
print(pref_XGB_under_tuned_train)
#Display the confusion matrix for the random forest model
confusion_matrix_sklearn(model=XGB_tuned_under,predictors=x_val,target=y_val)
Original Accuracy Recall Precision F1 0 0.765363 0.826087 0.655172 0.730769 Tuned Accuracy Recall Precision F1 0 0.815642 0.826087 0.730769 0.77551 Tuned Training Performance Accuracy Recall Precision F1 0 0.934066 0.923077 0.94382 0.933333
# test performance comparison
models_comparison = pd.concat(
[
pref_rForest_original.T,
pref_rForest_original_tuned.T,
pref_GBM_original.T,
pref_GBM_original_tuned.T,
pref_XGB_original.T,
pref_XGB_original_tuned.T,
pref_XGB_over.T,
pref_XGB_over_tuned.T,
pref_XGB_under.T,
pref_XGB_under_tuned.T,
],
axis=1,
)
models_comparison.columns = [
"rForest",
"rForest tuned",
"GBM",
"GBM tuned",
"XGB",
"XGB tuned",
"XGB OV",
"XGB OV tuned",
"XGB UN",
"XGB UN tuned",
]
print("Validation Performance Comparison across the best tuned and untuned models:")
models_comparison
Validation Performance Comparison across the best tuned and untuned models:
rForest | rForest tuned | GBM | GBM tuned | XGB | XGB tuned | XGB OV | XGB OV tuned | XGB UN | XGB UN tuned | |
---|---|---|---|---|---|---|---|---|---|---|
Accuracy | 0.821229 | 0.843575 | 0.849162 | 0.821229 | 0.843575 | 0.854749 | 0.843575 | 0.860335 | 0.765363 | 0.815642 |
Recall | 0.739130 | 0.768116 | 0.768116 | 0.710145 | 0.797101 | 0.782609 | 0.797101 | 0.826087 | 0.826087 | 0.826087 |
Precision | 0.784615 | 0.815385 | 0.828125 | 0.803279 | 0.797101 | 0.830769 | 0.797101 | 0.814286 | 0.655172 | 0.730769 |
F1 | 0.761194 | 0.791045 | 0.796992 | 0.753846 | 0.797101 | 0.805970 | 0.797101 | 0.820144 | 0.730769 | 0.775510 |
Insights¶
-The XGB Over sampled model has the best accuracy of 0.86
# Separating features and the target column
X = df_modeling.drop("Survived", axis=1)
Y = df_modeling["Survived"]
# create dummies
X = pd.get_dummies(
X,
columns=X.select_dtypes(include=["object", "category"]).columns.tolist(),
drop_first=True
)
# display head
X.head()
Pclass | Age | SibSp | Parch | Fare | Sex_male | Embarked_Q | Embarked_S | |
---|---|---|---|---|---|---|---|---|
0 | 3 | 22.0 | 1 | 0 | 7.2500 | 1 | 0 | 1 |
1 | 1 | 38.0 | 1 | 0 | 71.2833 | 0 | 0 | 0 |
2 | 3 | 26.0 | 0 | 0 | 7.9250 | 0 | 0 | 1 |
3 | 1 | 35.0 | 1 | 0 | 53.1000 | 0 | 0 | 1 |
4 | 3 | 35.0 | 0 | 0 | 8.0500 | 1 | 0 | 1 |
x_train2=X.copy()
y_train2=Y.copy()
# Synthetic Minority Over Sampling Technique
sm = SMOTE(sampling_strategy=1, k_neighbors=5, random_state=1)
x_train_over2, y_train_over2 = sm.fit_resample(x_train2, y_train2)
model_final=XGBClassifier(
subsample= 0.7,
n_estimators= 20,
max_depth= 9,
learning_rate= 0.35,random_state=1
)
model_final.fit(x_train_over2,y_train_over2)
XGBClassifier(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=None, early_stopping_rounds=None, enable_categorical=False, eval_metric=None, feature_types=None, gamma=None, gpu_id=None, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=0.35, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=9, max_leaves=None, min_child_weight=None, missing=nan, monotone_constraints=None, n_estimators=20, n_jobs=None, num_parallel_tree=None, predictor=None, random_state=1, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
XGBClassifier(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=None, early_stopping_rounds=None, enable_categorical=False, eval_metric=None, feature_types=None, gamma=None, gpu_id=None, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=0.35, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=9, max_leaves=None, min_child_weight=None, missing=nan, monotone_constraints=None, n_estimators=20, n_jobs=None, num_parallel_tree=None, predictor=None, random_state=1, ...)
# #Get the proability of failure or no failure from the model using predict_proba
# #set threshold
# threshold=0.4
# #get probablity of failure
# prob_of_survival=pd.DataFrame(model_final.predict_proba(x_val))[1]
# #Get data frame of at risk generators
# likely_to_survive=pd.DataFrame(prob_of_survival[prob_of_survival>=threshold])
# #Display a message
# print("If We use a threshold of",threshold,"We can say",len(likely_to_survive),"are likely to survive.")
Using the final model to make predications¶
df_test=data_test.copy()
df_test.drop(['PassengerId','Name','Ticket','Cabin'],axis=1,inplace=True)
# create dummies
df_test = pd.get_dummies(
df_test,
columns=df_test.select_dtypes(include=["object", "category"]).columns.tolist(),
drop_first=True
)
df_test.head()
Pclass | Age | SibSp | Parch | Fare | Sex_male | Embarked_Q | Embarked_S | |
---|---|---|---|---|---|---|---|---|
0 | 3 | 34.5 | 0 | 0 | 7.8292 | 1 | 1 | 0 |
1 | 3 | 47.0 | 1 | 0 | 7.0000 | 0 | 0 | 1 |
2 | 2 | 62.0 | 0 | 0 | 9.6875 | 1 | 1 | 0 |
3 | 3 | 27.0 | 0 | 0 | 8.6625 | 1 | 0 | 1 |
4 | 3 | 22.0 | 1 | 1 | 12.2875 | 0 | 0 | 1 |
#Using Final Model
#get predications
pred=model_final.predict(df_test)
#get predicated probability of survival
prob_of_survival=pd.DataFrame(model_final.predict_proba(df_test))[1]
#Make a copy of the test data
df_test2=df_test.copy()
#add a Survived column
df_test2['Survived']=pred
#Add a Probability of survival column
df_test2['prob_of_survival']=prob_of_survival*100
#Display the data head with the new columns
df_test2.head()
Pclass | Age | SibSp | Parch | Fare | Sex_male | Embarked_Q | Embarked_S | Survived | prob_of_survival | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 3 | 34.5 | 0 | 0 | 7.8292 | 1 | 1 | 0 | 0 | 3.857241 |
1 | 3 | 47.0 | 1 | 0 | 7.0000 | 0 | 0 | 1 | 0 | 10.933467 |
2 | 2 | 62.0 | 0 | 0 | 9.6875 | 1 | 1 | 0 | 0 | 8.820826 |
3 | 3 | 27.0 | 0 | 0 | 8.6625 | 1 | 0 | 1 | 0 | 37.190594 |
4 | 3 | 22.0 | 1 | 1 | 12.2875 | 0 | 0 | 1 | 1 | 51.639305 |
#get total number of 1's
print("Total number of surviving passangers in the test data:",len(df_test2[df_test2['Survived']==1]))
Total number of surviving passangers in the test data: 213
Insights¶
- Any passanagers with a probability of survival above 50% recieved a 1
- From the test data set of 418 passengaers 213 were expected to survive.
- Since we know from the original training data only about 40% of the passengers survived, we can adjusts the models threshold to get results coloser to what we might expect.
#Adjusting the model threshold
#Setting the threshold
threshold=0.74
#Get data frame of passengers likely to survive.
likely_to_survive=pd.DataFrame(df_test2[df_test2['prob_of_survival']>=threshold*100])
#Calculate and display the survivors and the percent of the total.
print("Out of", len(prob_of_survival), "passangers.", len(likely_to_survive), "are likely to survive.")
print(round(len(likely_to_survive)/len(prob_of_survival)*100,2),"% likely to survive.")
Out of 418 passangers. 166 are likely to survive. 39.71 % likely to survive.
#Display list of passengers likely to survive from the test data set.
likely_to_survive.head(10)
Pclass | Age | SibSp | Parch | Fare | Sex_male | Embarked_Q | Embarked_S | Survived | prob_of_survival | |
---|---|---|---|---|---|---|---|---|---|---|
8 | 3 | 18.0 | 0 | 0 | 7.2292 | 0 | 0 | 0 | 1 | 88.224327 |
10 | 3 | NaN | 0 | 0 | 7.8958 | 1 | 0 | 1 | 1 | 74.705513 |
12 | 1 | 23.0 | 1 | 0 | 82.2667 | 0 | 0 | 1 | 1 | 98.513237 |
14 | 1 | 47.0 | 1 | 0 | 61.1750 | 0 | 0 | 1 | 1 | 97.130569 |
15 | 2 | 24.0 | 1 | 0 | 27.7208 | 0 | 0 | 0 | 1 | 95.927986 |
22 | 1 | NaN | 0 | 0 | 31.6833 | 0 | 0 | 1 | 1 | 96.884499 |
23 | 1 | 21.0 | 0 | 1 | 61.3792 | 1 | 0 | 0 | 1 | 83.880630 |
24 | 1 | 48.0 | 1 | 3 | 262.3750 | 0 | 0 | 0 | 1 | 89.966728 |
26 | 1 | 22.0 | 0 | 1 | 61.9792 | 0 | 0 | 0 | 1 | 99.371407 |
29 | 3 | NaN | 2 | 0 | 21.6792 | 1 | 0 | 0 | 1 | 94.423431 |
Insights¶
- If we use a threshold of 0.74 or 74% 166 passenger would be likely to survive from the total 418 passengers.
- That is a 39.71% survival rate much closer to the training sets survival rate of 39%.
#Splitting original data to get a fresh x_train, y_train, x_test, and y_test.
#To be able to pass into the pipeline for training.
#copy data
df_train2=df_modeling.copy()
# Separating features and the target column
X = df_train2.drop("Survived", axis=1)
Y = df_train2["Survived"]
# Splitting the data into train and test sets in 75:25 ratio
x_train2, x_val2, y_train2, y_val2 = train_test_split(
X, Y, test_size=0.25, random_state=1, shuffle=True, stratify=Y
)
class columnDropperTransformer():
def __init__(self,columns):
self.columns=columns
def transform(self,X,y=None):
return X.drop(self.columns,axis=1)
def fit(self, X, y=None):
return self
class ReplaceWithTransformer():
def __init__(self,features=[],replace_struc=[],revert=False):
self.features=features
self.replace_struc=replace_struc
self.revert=revert
def replace_with(self,features,replace_struc,revert):
# do the replacing
for feature in features:
replacement=replace_struc[features.index(feature)]
if(revert==True): replacement = {v: k for k, v in replacement.items()}
data[feature].replace(replacement, inplace=True)
return self
def transform(self,X,y=None):
return X.drop(self.columns,axis=1)
return replace_with(self,self.features,self.replace_struc,self.revert)
def fit(self, X, y=None):
return replace_with(self,self.features,self.replace_struc,self.revert)
from sklearn.compose import make_column_selector, make_column_transformer
# defining pipe using make_pipeline
pipe_XGB = make_pipeline(
SimpleImputer(strategy="most_frequent"),
OneHotEncoder(handle_unknown="ignore"),
#SMOTE(sampling_strategy=1, k_neighbors=5, random_state=1),
(XGBClassifier(
subsample= 0.7,
n_estimators= 20,
max_depth= 9,
learning_rate= 0.35)
)
)
# fit pipe object to data
pipe_XGB.fit(x_train2,y_train2)
Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='most_frequent')), ('onehotencoder', OneHotEncoder(handle_unknown='ignore')), ('xgbclassifier', XGBClassifier(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=None, early_stopping_rounds=None, enable_categorical=False, eval_metric=None, feature_...=None, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=0.35, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=9, max_leaves=None, min_child_weight=None, missing=nan, monotone_constraints=None, n_estimators=20, n_jobs=None, num_parallel_tree=None, predictor=None, random_state=None, ...))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='most_frequent')), ('onehotencoder', OneHotEncoder(handle_unknown='ignore')), ('xgbclassifier', XGBClassifier(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=None, early_stopping_rounds=None, enable_categorical=False, eval_metric=None, feature_...=None, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=0.35, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=9, max_leaves=None, min_child_weight=None, missing=nan, monotone_constraints=None, n_estimators=20, n_jobs=None, num_parallel_tree=None, predictor=None, random_state=None, ...))])
SimpleImputer(strategy='most_frequent')
OneHotEncoder(handle_unknown='ignore')
XGBClassifier(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=None, early_stopping_rounds=None, enable_categorical=False, eval_metric=None, feature_types=None, gamma=None, gpu_id=None, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=0.35, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=9, max_leaves=None, min_child_weight=None, missing=nan, monotone_constraints=None, n_estimators=20, n_jobs=None, num_parallel_tree=None, predictor=None, random_state=None, ...)
Using the pipline to make predications¶
df_test=data_test.copy()
df_test.drop(['PassengerId','Name','Ticket','Cabin'],axis=1,inplace=True)
df_test.head()
Pclass | Sex | Age | SibSp | Parch | Fare | Embarked | |
---|---|---|---|---|---|---|---|
0 | 3 | male | 34.5 | 0 | 0 | 7.8292 | Q |
1 | 3 | female | 47.0 | 1 | 0 | 7.0000 | S |
2 | 2 | male | 62.0 | 0 | 0 | 9.6875 | Q |
3 | 3 | male | 27.0 | 0 | 0 | 8.6625 | S |
4 | 3 | female | 22.0 | 1 | 1 | 12.2875 | S |
#Using Pipeline
#get predications
pred=pipe_XGB.predict(df_test)
#get predicated probability of survival
prob_of_survival=pd.DataFrame(pipe_XGB.predict_proba(df_test))[1]
#Make a copy of the test data
df_test2=df_test.copy()
#add a Survived column
df_test2['Survived']=pred
#Add a Probability of survival column
df_test2['prob_of_survival']=prob_of_survival*100
#Display the data head with the new columns
df_test2.head(20)
Pclass | Sex | Age | SibSp | Parch | Fare | Embarked | Survived | prob_of_survival | |
---|---|---|---|---|---|---|---|---|---|
0 | 3 | male | 34.5 | 0 | 0 | 7.8292 | Q | 0 | 3.603213 |
1 | 3 | female | 47.0 | 1 | 0 | 7.0000 | S | 0 | 33.709225 |
2 | 2 | male | 62.0 | 0 | 0 | 9.6875 | Q | 0 | 2.217934 |
3 | 3 | male | 27.0 | 0 | 0 | 8.6625 | S | 0 | 29.124826 |
4 | 3 | female | 22.0 | 1 | 1 | 12.2875 | S | 0 | 41.915558 |
5 | 3 | male | 14.0 | 0 | 0 | 9.2250 | S | 0 | 4.398355 |
6 | 3 | female | 30.0 | 0 | 0 | 7.6292 | Q | 1 | 57.990604 |
7 | 2 | male | 26.0 | 1 | 1 | 29.0000 | S | 0 | 46.846775 |
8 | 3 | female | 18.0 | 0 | 0 | 7.2292 | C | 1 | 68.863167 |
9 | 3 | male | 21.0 | 2 | 0 | 24.1500 | S | 0 | 2.060762 |
10 | 3 | male | NaN | 0 | 0 | 7.8958 | S | 0 | 5.000453 |
11 | 1 | male | 46.0 | 0 | 0 | 26.0000 | S | 0 | 19.951044 |
12 | 1 | female | 23.0 | 1 | 0 | 82.2667 | S | 1 | 97.533676 |
13 | 2 | male | 63.0 | 1 | 0 | 26.0000 | S | 0 | 2.513480 |
14 | 1 | female | 47.0 | 1 | 0 | 61.1750 | S | 1 | 96.439850 |
15 | 2 | female | 24.0 | 1 | 0 | 27.7208 | C | 1 | 91.883820 |
16 | 2 | male | 35.0 | 0 | 0 | 12.3500 | Q | 0 | 3.171004 |
17 | 3 | male | 21.0 | 0 | 0 | 7.2250 | C | 0 | 17.245821 |
18 | 3 | female | 27.0 | 1 | 0 | 7.9250 | S | 1 | 73.813728 |
19 | 3 | female | 45.0 | 0 | 0 | 7.2250 | C | 1 | 68.709671 |
#get total number of 1's
print("Total number of surviving passangers in the test data:",len(df_test2[df_test2['Survived']==1]))
Total number of surviving passangers in the test data: 159
Insights¶
- Any passanagers with a probability of survival above 50% recieved a 1
- From the test data set of 418 passengaers 213 were expected to survive.
- Since we know from the original training data only about 40% of the passengers survived, we can adjusts the models threshold to get results coloser to what we might expect.
#Adjusting the model threshold
#Setting the threshold
threshold=0.5
#Get data frame of passengers likely to survive.
likely_to_survive=pd.DataFrame(df_test2[df_test2['prob_of_survival']>=threshold*100])
#Calculate and display the survivors and the percent of the total.
print("Out of", len(prob_of_survival), "passangers.", len(likely_to_survive), "are likely to survive.")
print(round(len(likely_to_survive)/len(prob_of_survival)*100,2),"% likely to survive.")
Out of 418 passangers. 159 are likely to survive. 38.04 % likely to survive.
#Display list of passengers likely to survive from the test data set.
likely_to_survive.head(10)
Pclass | Sex | Age | SibSp | Parch | Fare | Embarked | Survived | prob_of_survival | |
---|---|---|---|---|---|---|---|---|---|
6 | 3 | female | 30.0 | 0 | 0 | 7.6292 | Q | 1 | 57.990604 |
8 | 3 | female | 18.0 | 0 | 0 | 7.2292 | C | 1 | 68.863167 |
12 | 1 | female | 23.0 | 1 | 0 | 82.2667 | S | 1 | 97.533676 |
14 | 1 | female | 47.0 | 1 | 0 | 61.1750 | S | 1 | 96.439850 |
15 | 2 | female | 24.0 | 1 | 0 | 27.7208 | C | 1 | 91.883820 |
18 | 3 | female | 27.0 | 1 | 0 | 7.9250 | S | 1 | 73.813728 |
19 | 3 | female | 45.0 | 0 | 0 | 7.2250 | C | 1 | 68.709671 |
20 | 1 | male | 55.0 | 1 | 0 | 59.4000 | C | 1 | 54.749870 |
22 | 1 | female | NaN | 0 | 0 | 31.6833 | S | 1 | 96.679016 |
24 | 1 | female | 48.0 | 1 | 3 | 262.3750 | C | 1 | 96.181473 |
Insights¶
- If we use the default threshold of 0.5 159 passengers would be likely to survive from the total 418 passengers.
- That is a 38.04% survival rate much closer to the training sets survival rate of 39%.