Predicting Housing Prices Using XGBoost

Group: XGBoost

Susanna Brown (Advisor: Dr. Cohen)

2026-02-01

Introduction

Literature Review

An Optimal House Price Prediction Algorithm: XGBoost (Sharma, Harsora, and Ogunleye 2024)

  • Data: Ames City, Iowa housing data with 2930 records of 82 features
  • Goal: Predict home prices & identify most influential predictors, regression approach
  • Importance: Economic; Consumer spending, borrowing capacity, investments, real estate, lenders
  • Analysis:
    • Models: Linear regression, MLP, random forest, SVR, & XGBoost
    • Metrics: \(R^2\), adjusted \(R^2\), MSE, RMSE, & cross-validation metrics
    • Hyperparameter tuning (GridSearchCV) & feature selection used
  • Results: XGBoost best fit model
  • Limitations: 1 dataset, 1 city, unknown reliability, possible infrastructure influence (parks, hospitals, and transportation)

XGBoost: A Scalable Tree Boosting System (Chen 2016)

  • Data: Source, (observations, factures), purpose:
    • Allstate, (10M, 4227), insurance claim classification
    • Higgs Boson, (10M, 28), event classification
    • Yahoo LTRC, (473K, 700), ranking
    • Criteo, (1.7B, 67), ad click-through rate
  • Goal: Build end-to-end system featuring weighted quantile sketching, sparsity-aware algorithm and cache-aware out-of-core tree learning
  • Importance: XGBoost is widely used, effective, open-source, portable, scalable, & resistant to overfitting
  • Analysis: Explained equations; Primarily focusing on advantages of XGBoost for real-world data
  • Limitations: Not highlighted

A Comparative Analysis of XGBoost (Bentéjac, Csörgo, and Martı́nez-Muñoz 2019)

  • Data: 28 datasets from UCI repo
  • Goal: Compare methods, highlight importance of parameter tuning, identify most efficient parameters for XGBoost, & explore alternative grids
  • Importance: Accuracy & speed
  • Analysis: Grid search best fit, XGBoost every combination, & default parameters on training data
    • Methods: Random forest, gradient boosting, XGBoost
    • Metrics: Accuracy, Nemenyi test, speed
  • Results: Tuned XGBoost #1 except for accuracy #2; Default XGBoost worst except for speed
  • Limitations: Not highlighted other than about specific methods

Deep Learning with XGBoost for Real Estate Appraisal (Zhao, Chetty, and Tran 2019)

  • Data: AVA image database for unstructured; Australian website and Google Maps API for structured
  • Goal: Determine if XGBoost was better at predicting home prices with hybrid model
  • Importance: Business (lenders, insurance companies, governments) & individuals (real estate buyers/sellers/agents) depend on accuracy
  • Analysis: Aesthetic score from AVA, assessment score of visual features, price precondition form structured and unstructured data
    • Methods: Hybrid model of CNN, MSP, & XGBoost vs. KNN
    • Metrics: MAPE, MAE
    • Some pre-processing was done to clean data & eliminate low observations; 82/20 split
  • Results: XGBoost performed the best
  • Limitations: AVA scores still based on personal judgement of photographers

Research and Application of XGBoost in Imbalanced Data (Zhang, Jia, and Shang 2022)

  • Data: 2 sets
    • UCI credit card data predicting delinquency
    • European credit card fraud data from ULB
  • Goal: Better classification predictions in imbalanced data
  • Importance: Overall model accuracy may misclassify data and impact industries like medicine & finance
  • Analysis: Missing data, statistical analysis, standardization, outliers, & feature selection
    • Methods: Created algorithm “SEB-XGB” using SVM, SMOTE, EasyEnsemble, XGBoost, Bayesian, & 10 k-fold cross-validation
    • Metrics: AUC & G-mean
    • 70/30 split
  • Results: SEB-XGB performed best
  • Limitations: Binary classification

House Price Prediction using Hedonic Pricing Model and Machine Learning Techniques (Zaki et al. 2022)

  • Data: Kaggle.com Bostom, MA (506,14)
  • Goal: MLT can predict house prices more accurately, efficiently, & practically
  • Importance: Real estate market, property values to local government decision-making, economic growth, development
  • Analysis: Develop model, estimate price, compare
    • Methods: XGBoost & hedonic
    • Metrics: R-squared
    • 67/33 split
  • Result: XGboost double the accuracy
  • Limitations: Doesn’t consider other factors: mortgage, insurance, historic home, risk, etc.

Methods

Analysis and Results

Data Exploration and Visualization

Modeling and Results

Conclusion

References

Bentéjac, Candice, Anna Csörgo, and Gonzalo Martı́nez-Muñoz. 2019. “A Comparative Analysis of Xgboost.” ArXiv Abs 392.
Chen, Tianqi. 2016. “XGBoost: A Scalable Tree Boosting System.” Cornell University.
Sharma, Hemlata, Hitesh Harsora, and Bayode Ogunleye. 2024. “An Optimal House Price Prediction Algorithm: XGBoost.” Analytics 3 (1): 30–45.
Zaki, John, Anand Nayyar, Surjeet Dalal, and Zainab H Ali. 2022. “House Price Prediction Using Hedonic Pricing Model and Machine Learning Techniques.” Concurrency and Computation: Practice and Experience 34 (27): e7342.
Zhang, Ping, Yiqiao Jia, and Youlin Shang. 2022. “Research and Application of XGBoost in Imbalanced Data.” International Journal of Distributed Sensor Networks 18 (6): 15501329221106935.
Zhao, Yun, Girija Chetty, and Dat Tran. 2019. “Deep Learning with XGBoost for Real Estate Appraisal.” In 2019 IEEE Symposium Series on Computational Intelligence (SSCI), 1396–1401. IEEE.