In this project I scraped the web for information about various boats available for sale online. As a potential customer looking to buy a sailing boat one day my goal was to see if I can mine useful information from the data and help me in my decision making. Upon this data I also built a machine learning model using linear regression and ANNs which can predict the boat price. A large portion of the work went on scraping the correct information and cleaning the data afterwards.
Click the title for more information.
Picture 1: Most popular boat brands
Picture 2: Most popular types of boat
Picture 3: Boat price by material
Picture 4: Linear regression model
Model | MAE | MSE |
---|---|---|
Linear Regression | 45667.50 | 5200817068.73 |
ANN (Input: 121, Hidden: 121, Output: 1) | 34381.004 | 9830265000.0 |
ANN (Input: 121, Hidden: 121, Hidden: 121, Output: 1) | 34995.54 | 14412964000.0 |
ANN (Input: 121, Hidden: 242, Output: 1) | 34628.77 | 10111264000.0 |
ANN (Basic model with SGD) | 45051.16 | 8585146000.0 |
As expected ANN turned out to be a more powerful and effective machine learning model for predicting the price of boats, but even with hyperparameter tuning the model couldn’t go better then a MAE score of 34000, meaning that doing more feature engineering and reducing the number of features would be the next step in model optimization.
The Spacehip Titanic dataset is a Kaggle competition dataset based on the famous original Titanic dataset, this time set in future. The goal of the project was explore which features influence the most a persons ability to get transported and create the best possible model which could predict that.
Through data exploration and data visualization I found out that:
Being from planet Europa
Being in Cryosleep
Traveling to Trappist or Cancri
Being of younger age
Being richer/spending more
Gives a person better chance of being transported on the spaceship.
Graph 1: There is a higher change to be transported if you are from planet Europa
Graph 2: Travelling to TRAPPIST or Cancri gives you higher change to be transported
Graph 3: The age distribution of people who were or were not transported
Graph 4: Correlation heatmap
Model | Accuracy |
---|---|
Logistic Regression | 0.7779 |
Random Forest | 0.8033 |
SVM | 0.7612 |
XGB Classifier | 0.8039 |
ANN | 0.8016 |
I was able to get very similiar score of around 80%, the best being XGB Classifier by a small margain. I was not able to go much more than 80% with optimization. The only thing that might have helped was reducing the number of total features in the final dataset.
In this project I used a Kaggle dataset to create a classification model that could predict the quality of the wine by their unique features and qualities.
Graph 1: The overall distribution of the features was generally equal
Graph 2: But there were still a lot of outliers that needed to be cleaned
Graph 3: Correlation heatmap
Graph 4: Piechart showing the inbalance in the labels
Model | Accuracy |
---|---|
Decision Tree | 0.7701 |
Random Forest | 0.8551 |
KNN | 0.8367 |
XGB Classifier | 0.8459 |
In the end the Random Forest Classifier proved to be the most successful with an accuracy score of 0,86. The next best model was the XGB Classifier with an accuracy score of 0,84. The KNN model with only 1 neighbour showed a good accuracy score but such model would have a high bias and would not be very realistic, thus it isn’t of use to us.
In this project I analysed the Store Sales Kaggle competition dataset. It consists of several datasets such as: train, test, oil, holidays, stores and oil prices. The goal of the project was to analyse and extract various information and create a Machine Learning model that could predict store sales.
Graph 1: Sales per month
Graph 2: Sales by store family
Graph 3: Sales by state
Graph 4: Linear model visualized
XGB | Results |
---|---|
Accuracy | 0.9364 |
MAE | 95.1703 |
MSE | 76950.7888 |
RMSE | 277.4001 |