Impact of Categorical Feature Encoding on Machine Learning Based Shear Strength Prediction
Abstract
Recent advancements in machine learning (ML) offer powerful tools for predicting the structural integrity of civil infrastructure. A critical yet often overlooked aspect of ML modeling is data preprocessing, particularly categorical feature encoding. This study examines how different encoding schemes for interface conditions (monolithic, rough, smooth) impact ML predictions of interfacial shear strength. It compares categorical encodings (one-hot and five label encoding variations) with numerical friction coefficients (monolithic = 1.4, rough = 1.0, smooth = 0.6) across four ML models: eXtreme Gradient Boosting (XGBoost), Random Forest (RF), Support Vector Regression (SVR), and Artificial Neural Network (ANN). Among 28 models, categorical encodings outperform numerical representations in 88% of cases, with ensemble models (RF, XGBoost) proving robust yet RF more sensitive to encoding variations. SVR and ANN exhibit encoding-dependent performance, with some SVR models achieving 12% higher accuracy than numerical- based models. Findings emphasize the crucial role of encoding choices in ML model performance, advocating for adaptive preprocessing techniques to enhance reliability in structural engineering and beyond.
DOI
10.12783/shm2025/37535
10.12783/shm2025/37535
Full Text:
PDFRefbacks
- There are currently no refbacks.