Titanic Prediction Model
Sailing through Data: Who Survives the Titanic?
Link to project repository on GitHub
Goal: build a machine learning model to predict if a passenger survived the sinking of the Titanic or not.
For each in the test set, you must predict a 0 or 1 value for the variable (Classifier).
Submissions are evaluated on accuracy. The score is the percentage of passengers you correctly predict (this is known as accuracy).
Steps
- EDA
- Feature Engineering
- Family Size: Larger families might have different survival rates compared to solo travelers.
- Person's Title: (e.g., Ms, Mr) Titles can provide insight into age, gender, and social status, which might affect survival chances.
- Cabin Deck: The deck could correlate with proximity to lifeboats and thus survival rates.
- Cabin Assigned: Passengers who have not been assigned a cabin might have different survival probabilities compared to those with recorded cabin details.
- Age Group: Different age groups might have had different survival probabilities.
- Fare Price Groups: I will create different groups of fare price, which can capture non-linear relationships between fare and survival.
- Name Length: Especially in the early 1900s, a person with a longer name could indicate importance, which can impact survival rate.
- Preprocessing
- Dealing With Nulls
- Split the Data
- Create Pipelines + Transform Columns
- Visualize and Understand Data
- Histogram
- KDE
- Pie Chart
- Heatmap
- Define Models
I created 5 models:- Model 1: Random Forest Classifier
- Model 2: Logistic Regression
- Model 3: K-Nearest Neighbours
- Model 4: XGBoost
- Model 5: Adaptive Boost
- Create Competition Submission
Result of Model Evaluations
Model 1: Random Forest Regressor
- Best Score: 0.834
- Correct: 138
- Incorrect: 41
Model 2: Logistic Regression
- Best Score: 0.795
- Correct: 141
- Incorrect: 38
Model 3: K-Nearest Neighbours
- Best Score: 0.829
- Correct: 136
- Incorrect: 43
Model 4: XGBoost
- Best Score: 0.803
- Correct: 137
- Incorrect: 42
Model 5: Adaptive Boost
- Best Score: 0.819
- Correct: 137
- Incorrect: 42
Competition Scores (Best to Worst)
- Model 4: XGBoost - 0.76555
- Model 2: Logistic Regression - 0.76794
- Model 5: Adaptive Boost - 0.77751
- Model 3: K-Nearest Neighbours - 0.77990
- Model 1: Random Forest Regressor - 0.78229
Data
The dataset used in this project is available publicly on Kaggle: https://www.kaggle.com/competitions/titanic/data
Technologies
Python- pandas, numpy, matplotlib, seaborn
- sklearn (OrdinalEncoder, OneHotEncoder, SimpleImputer, make_column_transformer, ColumnTransformer, Pipeline, LogisticRegression, DecisionTreeClassifier, KNeighborsClassifier, RandomForestClassifier, AdaBoostClassifier, cross_val_score, GridSearchCV, ConfusionMatrixDisplay)
- xgboost