Multivariate House Price Prediction Practice Problem
This data science coding problem helps you practice Linear Regression, multivariate house price prediction, and implementation skills. Read the problem statement, write your solution, and strengthen your understanding of Linear Regression.
- Problem ID: 6
- Problem key: 6-multivariate-house-price-prediction
- URL: https://datacrack.app/solve/6-multivariate-house-price-prediction
- Difficulty: easy
- Topic: Linear Regression
- Module: Introduction to Machine Learning
Problem Statement
# Multivariate House Price Prediction
---
### 🎯 Goal
In this problem, you’ll extend your linear regression knowledge to handle **multiple features**.
You’ll predict **house prices** using the **California Housing dataset** — but now considering **several input variables** at once.
You’ll learn how to:
- Use **multiple features** (columns) as inputs to linear regression
- Train a **multivariate LinearRegression** model using **scikit-learn**
- Understand how the model combines several variables to make predictions
---
### 📊 Dataset Description
We use the **California Housing dataset** from `sklearn.datasets`, which contains real data about housing districts in California.
The features include:
| Column | Description |
|:-------|:-------------|
| **MedInc** | Median income in the area (in tens of thousands of dollars) |
| **AveRooms** | Average number of rooms per household |
| **AveOccup** | Average number of household members |
| **HouseAge** | Median age of houses in the district |
| **Population** | Total population of the district |
The target variable is:
| Column | Description |
|:-------|:-------------|
| **MedHouseVal** | Median house value (in hundreds of thousands of dollars) |
---
### 📥 Input / 📤 Output
- **Input:**
`X_test`: pandas DataFrame containing columns
`['MedInc', 'AveRooms', 'AveOccup', 'HouseAge', 'Population']`
- **Output:**
`y_pred`: predicted house prices (NumPy array or pandas Series)
---
### 💻 Task
Implement a function `train_multivariate_model(X_test)` that:
1. **Loads** the California housing dataset using `fetch_california_housing()`.
2. **Selects** the five features listed above and the target variable `'MedHouseVal'`.
3. **Trains** a linear regression model using **`sklearn.linear_model.LinearRegression`**.
4. **Predicts** house prices for the provided test data `X_test`.
5. **Returns** the predictions only.
---
### 🧩 Starter Code
```python
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.datasets import fetch_california_housing
def train_multivariate_model(X_test):
# Step 1: Load California Housing dataset
data = fetch_california_housing(as_frame=True)
df = data.frame
# Step 2: Select multiple features and target
feature_cols = ['MedInc', 'AveRooms', 'AveOccup', 'HouseAge', 'Population']
X_train = df[feature_cols]
y_train = df['MedHouseVal']
# TODO: Train and predict
# 1. Initialize LinearRegression()
# 2. Fit the model
# 3. Predict y_pred on X_test
# 4. Return y_pred only
pass
````
---
### 💡 Example + Expected Output
```python
X_test = {
'MedInc': [3.0, 5.0],
'AveRooms': [5.0, 6.5],
'AveOccup': [3.0, 2.0],
'HouseAge': [25, 40],
'Population': [1200, 500]
}
y_pred = train_multivariate_model(X_test)
print(y_pred.round(2))
```
**Expected Output (example):**
```
[1.62 2.72]
```
---
### 🧠 Hint
In multivariate regression, each feature contributes its own weight:
$$
\hat{y} = w_0 + w_1 x_1 + w_2 x_2 + \dots + w_n x_n
$$
The model automatically finds the best-fitting weights for all variables together.