Data-Science-Portfolio

1. Boat Price Prediction

About

In this project I scraped the web for information about various boats available for sale online. As a potential customer looking to buy a sailing boat one day my goal was to see if I can mine useful information from the data and help me in my decision making. Upon this data I also built a machine learning model using linear regression and ANNs which can predict the boat price. A large portion of the work went on scraping the correct information and cleaning the data afterwards.

Click the title for more information.

Methods used

Web scraping using Scrapy
Data cleaning
EDA
Feature enginnering
Model building

Key findings

Motor yachts are the most popular type of boat, followed by sailing boats and sport boats
Germany is the most popular location for boat brokers, followed by Switzerland and Italy. For the fact that Switzerlands doesn’t have access to the sea, this is very impressive
Sunseeker manufactures the most expensive yachts
Price is highly correlated with Length and Width, which makes sense because the boat size is the strongest determining factor of the price

Some key graphs

alt text

Picture 1: Most popular boat brands

alt text

Picture 2: Most popular types of boat

alt text

Picture 3: Boat price by material

alt text

Picture 4: Linear regression model

Model performance

Model	MAE	MSE
Linear Regression	45667.50	5200817068.73
ANN (Input: 121, Hidden: 121, Output: 1)	34381.004	9830265000.0
ANN (Input: 121, Hidden: 121, Hidden: 121, Output: 1)	34995.54	14412964000.0
ANN (Input: 121, Hidden: 242, Output: 1)	34628.77	10111264000.0
ANN (Basic model with SGD)	45051.16	8585146000.0

As expected ANN turned out to be a more powerful and effective machine learning model for predicting the price of boats, but even with hyperparameter tuning the model couldn’t go better then a MAE score of 34000, meaning that doing more feature engineering and reducing the number of features would be the next step in model optimization.

2. Spaceship Titanic Classification Prediction

About

The Spacehip Titanic dataset is a Kaggle competition dataset based on the famous original Titanic dataset, this time set in future. The goal of the project was explore which features influence the most a persons ability to get transported and create the best possible model which could predict that.

Methods used

Exploratory data analysis
Data cleaning
Feature engineering
Using pipelines
Model building and optimization

Key findings

Through data exploration and data visualization I found out that:

Being from planet Europa
Being in Cryosleep
Traveling to Trappist or Cancri
Being of younger age
Being richer/spending more

Gives a person better chance of being transported on the spaceship.

Some key graphs

alt text

Graph 1: There is a higher change to be transported if you are from planet Europa

alt text

Graph 2: Travelling to TRAPPIST or Cancri gives you higher change to be transported

alt text

Graph 3: The age distribution of people who were or were not transported

alt text

Graph 4: Correlation heatmap

Model performance

Model	Accuracy
Logistic Regression	0.7779
Random Forest	0.8033
SVM	0.7612
XGB Classifier	0.8039
ANN	0.8016

I was able to get very similiar score of around 80%, the best being XGB Classifier by a small margain. I was not able to go much more than 80% with optimization. The only thing that might have helped was reducing the number of total features in the final dataset.

3. Wine Quality Classification Prediction

About

In this project I used a Kaggle dataset to create a classification model that could predict the quality of the wine by their unique features and qualities.

Methods used

Exploratory data analysis
Data cleaning
Model building
SMOTE oversampling
Model optimization

Some key graphs

alt text

Graph 1: The overall distribution of the features was generally equal

alt text

Graph 2: But there were still a lot of outliers that needed to be cleaned

alt text

Graph 3: Correlation heatmap

alt text

Graph 4: Piechart showing the inbalance in the labels

Model performance

Model	Accuracy
Decision Tree	0.7701
Random Forest	0.8551
KNN	0.8367
XGB Classifier	0.8459

In the end the Random Forest Classifier proved to be the most successful with an accuracy score of 0,86. The next best model was the XGB Classifier with an accuracy score of 0,84. The KNN model with only 1 neighbour showed a good accuracy score but such model would have a high bias and would not be very realistic, thus it isn’t of use to us.

4. XGB Regressor Time Series Project

About

In this project I analysed the Store Sales Kaggle competition dataset. It consists of several datasets such as: train, test, oil, holidays, stores and oil prices. The goal of the project was to analyse and extract various information and create a Machine Learning model that could predict store sales.