House Price Prediction | Bagus Deva Portfolio

Project Overview

This project is part of the Kaggle competition "House Prices: Advanced Regression Techniques" where I developed machine learning models to predict house prices based on 79 explanatory variables. The dataset describes residential properties in Ames, Iowa, with features ranging from lot size and quality ratings to specific details like basement condition and roof material.

Rather than focusing solely on competition rankings, this project demonstrates a complete machine learning workflow from exploratory data analysis and preprocessing to model training, evaluation, and deployment. Each step is thoroughly documented to highlight the educational aspects and decision-making process behind building an effective prediction system.

The project emphasizes feature engineering techniques, statistical transformations to handle data distributions, and systematic model selection. A core focus is creating a robust preprocessing pipeline that can reliably handle new data while effectively addressing the challenges of missing values, categorical variables, and feature interactions.

Key Features

Comprehensive Data Preprocessing: Pipeline handling missing values, categorical encoding, and feature transformations with strategies tailored to each variable type.
Advanced Feature Engineering: Techniques including temporal variable transformation, quality variable mapping, and target-based encoding to enhance model performance.
Robust Model Evaluation: Cross-validation with multiple metrics (RMSE, MSE, R²) to ensure reliable performance assessment across different data splits.
Hyperparameter Optimization: Systematic tuning for linear models (Linear Regression, Ridge, Lasso, ElasticNet) to maximize predictive performance.
Ensemble Learning: Combination of multiple optimized models to leverage their complementary strengths and improve overall prediction accuracy.
Automated Feature Selection: Implementation of Lasso regularization to identify the most predictive variables and reduce model complexity.

Technologies Used

Python

Pandas

Scikit-learn

XGBoost

LightGBM

CatBoost

Matplotlib

NumPy

SciPy

Cross-validation

Challenges & Solutions

Extensive Missing Data

The dataset contained numerous missing values that required different handling approaches based on their meaning (true absence vs. not applicable).

Solution:

Developed a strategic approach that distinguished between different types of missing values. For structural features where NA meant the feature didn't exist (e.g., no pool), encoded missing values as a separate category or zero. For features where values were truly missing, used advanced imputation techniques based on related variables, creating a comprehensive missing data pipeline that preserved the information content.

Complex Categorical Variables

Many categorical variables had high cardinality or ordinal relationships that were difficult to encode effectively.

Solution:

Implemented target-based encoding for high-cardinality variables like neighborhood, carefully avoiding target leakage through cross-validation. For ordinal variables like quality ratings, developed numeric mappings that preserved the inherent ordering. Applied strategic grouping for rare categories to reduce dimensionality while maintaining predictive power.

Skewed Distributions

Many numerical features and the target variable (sale price) had highly skewed distributions that affected model performance.

Solution:

Applied log and Yeo-Johnson transformations to normalize variables with skewed distributions. Used Box-Cox transformations for positive features and developed a systematic approach to identify which variables benefited most from transformation. This normalization process significantly improved the performance of linear models that assume normally distributed features.

Feature Selection

With 79 original variables plus engineered features, selecting the most relevant predictors without overfitting was challenging.

Solution:

Used Lasso regularization to identify the most predictive variables in an automated way. Implemented recursive feature elimination with cross-validation to determine the optimal feature subset. Combined statistical tests and domain knowledge to select features, resulting in a more interpretable model with comparable performance to models using all features.

Model Selection

Choosing from numerous regression algorithms with different strengths and weaknesses required systematic evaluation.

Solution:

Systematically evaluated 11 different algorithms including linear models, tree-based methods, and gradient boosting techniques. Developed a standardized evaluation framework using k-fold cross-validation with multiple metrics. Created a consistent pipeline that allowed fair comparison between models, ultimately leading to an ensemble approach that combined the best-performing models.

Future Improvements

I plan to enhance this project with several improvements:

Implementing more advanced feature engineering techniques (polynomial features, interaction terms)
Exploring deep learning approaches for regression tasks on structured real estate data
Creating a simple web application for real-time predictions using new property data
Applying the techniques developed to other real estate datasets for transfer learning
Further optimizing the ensemble with weighted averaging or stacking methodologies

Impact & Results

11

Models Evaluated

79

Features Analyzed

5-fold

Cross-validation

The project successfully resulted in an ensemble model combining multiple optimized linear models, achieving competitive performance on the Kaggle competition. More importantly, the process demonstrated a complete machine learning workflow from exploratory data analysis through deployment, providing valuable practical experience in real-world data science.

A key outcome was the development of a reusable preprocessing pipeline for real estate data that can be applied to similar datasets in the future. The systematic approach to feature engineering revealed valuable insights about which techniques provided the most significant performance improvements, with transformations of skewed features and handling of categorical variables showing the highest impact.

The project included creating visualizations that compared model performance across different metrics, making it easier to understand the trade-offs between various algorithms and hyperparameter choices. These visual tools provide an effective way to communicate technical results to non-technical stakeholders.

Learn More

Explore the technical details and implementation of this property price prediction model:

Read on Medium View Code on GitHub

Bagus Deva

PricePredictor: Property Value Estimation with Machine Learning

Project Overview

Key Features

Technologies Used

Challenges & Solutions

Extensive Missing Data

Solution:

Complex Categorical Variables

Solution:

Skewed Distributions

Solution:

Feature Selection

Solution:

Model Selection

Solution:

Future Improvements

Impact & Results

Learn More

MSIB Sentiment Analysis

AI Assistant Udayana