Machine Learning - Titanic

  

Problem:

Sinking of the Titanic | National Geographic Society

 

Titanic - Machine Learning from Disaster
Following link has detailed problem description:
https://www.kaggle.com/c/titanic/overview/description

Observation: 

Firstly I analyzed if there are missing values in different columns of train_data and test_data dataset by verifying the count of values present in each column.
I could see that there are missing values in Age column of train_data and test_data dataset. Also, there are missing values in Fare column of test_data  dataset.
Secondly, it can be noted that Cabin and Name columns are not being used for our calculation of passenger’s survival.

Analysis:

Based on the observations made I applied median value to missing values of these columns. I dropped these columns from our train_data and test_data. Also, using the boxplot, I could see that there are outliers in Age column of  train_data dataset. I removed the rows having outlier values in this dataset as well.






Code:











Outcome:

Above analyzed changes were done after using Random Forest Classifier as per instructions given and getting a score of 0.77511. After applying the changes for missing values, dropping the columns and removing the rows for outliers in dataset, the final score achieved is 0.79186. Screenshot below has all the changes seen in score.



Conclusion:

Median value can be used in place of missing values to improve the performance of a model.
Dropping the columns not used for analysis also helps in improving the model.
Removing the rows having outlier values also helps in improving the performance of model.













Comments

Popular posts from this blog

Garbage Classification

Sentiment Analysis using NBC

Study - Concept of overfitting using higher order linear regression