House Price Predictor

Predicting Home Sale Prices with Machine Learning

Link to project repository on GitHub

Goal: build a machine learning model to predict the sales price for each house. For each Id in the test set, I must predict the value of the SalePrice variable.

Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)

Steps

  1. Select Features
    • There are a lot of features to choose from (80). I am going to focus on the ones that I think will have the most significant impact on housing price. I focused on features related to:
      • Quality and Condition
      • Size and Space
      • Year Built and Remodel Date
      • Location
      • Amenities
  2. Visualize and Understand Data
    • Histograms
    • Correlation Heatmap
  3. Preprocessing
    • Dealing with skewed data
    • Null Values
    • Split the Data (Into train_X, val_X, train_y, val_y,)
    • Dealing with Categorical Features
      • Ordinal Encoding
      • OneHot Encoding
      • Target Encoding
      • Deciding if a feature was worth keeping. This included:
        • ANOVA (Analysis of Variance) Testing
        • Boxplot
        • Bar chart
  4. Define Model
    • I create 3 models total
      • Model 1: Random Forest Regressor
      • Model 2: Random Forest Regressor + parameters from Cross-Validation
      • Model 3: XGBoost + parameters from Cross-Validation
    • I create model 2 and 3 in steps 6 and 7 (respectively). In this step I only create model 1.
  5. Evaluate Model
    • Mean Absolute Error (MAE)
    • Mean Squared Error (MSE)
    • Root Mean Squared Error (RMSE)
    • R² Score (R-Squared) (coefficient of determination)
    • Created a visualization to see model (at top of ReadMe)
  6. Cross-Validation
    • Performed grid search for hyperparameter tuning
    • Evaluated the best parameters and model performance using the same metrics as step 5
    • Created a visualization to see model (at top of ReadMe)
  7. XGBoost (with Cross-Validation)
    • Performed grid search for hyperparameter tuning
    • Evaluated the best parameters and model performance using the same metrics as step 5
    • Created a visualization to see model (at top of ReadMe)

Result of Model Evaluations

The range in SalePrice is between $34,900 and $755,000

Model 1: Random Forest Regressor

Model 2: Random Forest Regressor + Cross-Validation

Model 3: XGBoost + Cross-Validation

Data

Database: https://www.kaggle.com/competitions/home-data-for-ml-course
There are 80 columns in this database, but I only used 25. Here are the ones I used:

View My Submission on Kaggle

The link to the code for my submission is here