Code
Group: XGBoost
The study utilized a Kaggle.com dataset of Ames City, Iowa housing data composed of 2930 records of 82 features to predict house prices. Motivated by other studies, which were narrowly focused on model development and a classification approach (higher or lower than listed price), the authors strived instead for a regression approach focusing on optimizing the prediction model and identifying the most influential predictors. The paper highlights the economic importance of accurate predictions, such as consumer spending, borrowing capacity, investment decisions, and impacts on real estate operations, including mortgage lenders. The study was conducted following a six-stage machine learning (ML) pipeline: (1) data collection, (2) data preprocessing, (3) model training, (4) model tuning, (5) prediction and deployment, and (6) monitoring and maintain. Analysis began by comparing five methods to find the best performing model: “…linear regression (LR), multilayer perceptron (MLP), random forest regression (RF), support vector regressor (SVR), and extreme gradient boosting (XGBoost)…” (p. 33). Detailed explanations of each methods benefits, limitations, and equations were provided. XGBoost was emphasized “…based on its interpretability, simplicity, and performance accuracy” (p. 31) as well as its resistance to overfitting and ability to solve real-world problems efficiently. Initial analysis revealed that XGBoost outperformed the other models in terms of R-squared, adjusted R-squared, mean squared error (MSE), root mean squared error (RMSE), and cross-validation (CV) metrics. Each metric also had a brief description and equation listed, satisfying a goal stated in the paper, “to provide a thorough understanding of the metrics used for the model comparisons and the selection of the best-performing model” (p. 38). Similarly, the authors demonstrated the importance of hyperparameter tuning through GridSearchCV, KerasTuner, and RandomSearchCV. The model comparisons were reproduced with GridSearchCV hyperparameter tuning, again resulting in XGBoost performing the best. Feature selection was also conducted to reduce dimensionality. Noted limitations included reliance on a sole dataset, which only included one city and had unknown reliability, as well as acknowledging possible influence from infrastructure variables such as parks, hospitals, and transportation, on housing prices (Sharma, Harsora, and Ogunleye 2024).
In the paper, the authors focus on the advantages of XGBoost as a widely used, effective, open-source, portable, and scalable machine learning method that is resistant to overfitting. They point out several use cases for challenges on Kaggle.com including statistics surrounding 2015 challenges where XGBoost was utilized in winning challenges. The goal of the paper was to highlight optimizations for XGBoost, such as its ability to handle sparse data, speed, and scalability. To do this, the authors outline their contributions on page 2:
The paper explains each point with figures and mathematical equations, briefly addressing limitations when applicable, but primarily focusing on the advantages of XGBoost when analyzing large data from four sources. The data chosen was split into test and training data, and ranges in size from 473,000 to 1.7 billion observations and 28 to 4227 features across tasks such as insurance claim classification, event classification, ranking, and ad click-through rate prediction to highlight the efficiency and scalability of XGBoost in real-world applications (Chen 2016).
As the title implies, the article compares the methods random forest, gradient boosting, and eXtreme Gradient Boosting (XGBoost) and highlights the importance of parameter tuning for accuracy and speed. Furthermore, the authors explored additional goals, including identifying the most efficient combination of parameters for XGBoost and exploring alternative grids for the methods. The authors go into depth about each method, including advantages and disadvantages, as well as the attributes that were adjusted. To show the importance of a consistent and fair analysis, they not only explain this as the reason for not utilizing the built-in feature of XGBoost to handle missing values, but also detail the computer processor used to run the models. The analysis used data from 28 datasets of varying size (and missing values) from the UCI repository across different application fields. To compare the methods, first, a stratified 10-fold cross-validation with grid search was conducted on the training set to select the best parameters. In order to create a more optimized parameter combination for XGBoost, the training set was run against each possible combination of parameters among five: learning_rate, gamma, max_depth, colsample_bylevel, and subsample. Next, the default parameters for each of the three methods were compared. The analysis was concluded by calculating the generalization accuracy of the parameters found from the grid search, default, and various XGBoost results against the test data. Results across the 28 datasets showed XGBoost achieved the second-highest accuracy when using optimized (tuned) parameters, but the worst when using default parameters. Rankings determined using the Nemenyi test found similar results; XGBoost with optimized parameters was the highest average rank. Finally, XGBoost also had the fastest performance, without taking into account grid search time (because the grid sizes are different across the methods). Overall, the results effectively highlighted how often tuned or default methods perform depending on the contents of the dataset, including noise, missing values, overfitting, etc (Bentéjac, Csörgo, and Martı́nez-Muñoz 2019).
In the paper, the authors strive to consider unstructured and structured data to accurately predict real estate prices using a hybrid deep learning model that includes XGBoost at its topmost layer. Accuracy in real estate prices impacts many businesses and individuals – real estate agents, home buyers and sellers, insurance companies, lenders, and governments – which served as the motivation for the authors. The authors explain how XGBoost is often used in data science for image classification for things like social media and image aesthetics, leading to their data source. The unstructured data, typically high-quality property images taken by photographers, was sourced from AVA, a large database for aesthetic visual analysis, then combined with structured, factual data from an online Australian website to conduct the analysis, and the Google Maps API for geographical coordinates. Some pre-processing was done to eliminate data with low observations of photos and suburbs, and the data was split into an 80%/20% training and test set. To conduct the analysis, first, an aesthetic score was automatically assigned to the chosen property images from 1-10, 10 being the most aesthetically pleasing, based on the AVA score. AVA scores were pre-determined based on personal judgement of claimed photographers’ votes, leaving a possible objective limitation on the analysis that the authors sought to combat – the impact of appraiser or real estate agents’ personal judgement in home prices. Regardless, once the scores were obtained, 4 images from each property were cropped and combined into a single larger image to extract visual contents. Next, a mean quality assessment score was assigned from the results of a hybrid model consisting of a convolutional neural network (CNN), MLP, and XGBoost, among other layers detailed in the text. Finally, these assessment scores were combined with the structured data to predict price. Evaluated against mean absolute error (MAE) and mean absolute percentage error (MAPE), XGBoost was found to be the most accurate model (compared to the K-Nearest Neighbors (KNN) algorithm) in determining housing prices from the structured and unstructured data (Zhao, Chetty, and Tran 2019).
The article shows that although XGBoost is a strong ensemble method, it may not be the best choice for handling imbalanced categorical data. The authors highlight the importance of such data in fields utilizing artificial intelligence (AI) and machine learning (ML), such as medicine and finance, where, for example, a category displaying non-fraudulent charges may significantly outnumber fraudulent charges. The goal of the paper was to overcome typical evaluation methods focused on accuracy that misclassify imbalanced classes and instead determine a “…higher recognition rate and better classification prediction effect.” (p. 9). The data, which underwent a 70/30 split, was sourced from two datasets: (1) Taiwan credit card data from the UCI website that included one categorical variable that predicted if the member would become delinquent, and (2) credit card fraud data from ULB European machine learning labs, with a binomial category of fraud or not. The analysis encompassed three phases. First, the authors looked for missing values, performed statistical analysis, and conducted standardization. Next, outliers were examined and combined with nearby classes to locate and reduce imbalance. Finally, feature selection was performed. To conduct the analysis, an algorithm was created consisting of: support vector machine (SVM), synthetic minority over-sampling technique (SMOTE) referred to as “SVM-SMOTE”, EasyEnsemble, XGBoost, Bayesian parameter optimization, and a 10 K-fold cross-validation. The algorithm was efficiently referred to as “SEB-XGB”. Comparison was first made against XGBoost (with default parameters), SEB-XGB, and the algorithms formed in between, then other models RUSBoost, CatBoost, LightGBM, and EBB-XGBoost. The article discussed the methods used in the analysis in depth, in addition to the computing setup, packages used, and their versions. Evaluated on area under the curve (AUC) and G-mean values to determine feasibility and effectiveness, the SEB-XGB model performed better than XGBoost and the additional combinations/models. However, the authors did address that a limitation of the analysis was that the classification features in the data were binary, which could be impacting the results (Zhang, Jia, and Shang 2022).
The study is motivated by the importance of real estate markets and property values to local governments’ decision-making and to their economic growth and development. To avoid events such as the 2008 recession, new machine learning techniques (MLTs) can predict property prices more accurately, efficiently, and practically than current models, and help prevent overinflated property values. To test this, data sourced from Kaggle.com, consisting of 506 observations and 14 features of Boston, MA, was first processed by removing noisy data, performing correlation analysis, and feature encoding. Then, with a 67/33 split, the authors develop a model, estimate housing prices, and finally compare XGBoost and hedonic models using several evaluation metrics, including RMSE, recall, precision, f-measure, and sensitivity, with the end result being the R-squared score. XGBoost was found to produce double the accuracy of the hedonic model. A limitation of the study was that other adjustments (mortgage costs, insurance, historic property valuation, risk, etc.) were not considered (Zaki et al. 2022).
#