Fill Missing Values with Mean, Median, Mode Practice Problem
This data science coding problem helps you practice Missing Data Handling, fill missing values with mean, median, mode, and implementation skills. Read the problem statement, write your solution, and strengthen your understanding of Missing Data Handling.
- Problem ID: 24
- Problem key: 24-fill-missing-values-with-mean-median-mode
- URL: https://datacrack.app/solve/24-fill-missing-values-with-mean-median-mode
- Difficulty: easy
- Topic: Missing Data Handling
- Module: Data Cleaning
Problem Statement
# 🧩 Fill Missing Values with Mean, Median, or Mode
---
### 🎯 Goal
One of the most common techniques for handling missing data is **imputation** — filling missing values with estimated values.
The three most popular statistical imputation methods are:
- **Mean**: Average of non-missing values (for numerical data)
- **Median**: Middle value when sorted (for numerical data, robust to outliers)
- **Mode**: Most frequent value (for categorical data)
---
### 🔍 When to Use Each Strategy?
| Strategy | Best For | Pros | Cons |
|:--------:|:---------|:-----|:-----|
| **Mean** | Normally distributed numerical data | Simple, preserves sum | Sensitive to outliers |
| **Median** | Skewed numerical data with outliers | Robust to outliers | Doesn't preserve distribution well |
| **Mode** | Categorical data | Only option for categories | May introduce bias if mode is dominant |
---
### 📥 Input
- `data`: A dictionary where keys are column names and values are lists of data (this is the format from JSON)
- Example: `{"A": [1, 2, null, 4], "B": [10, null, 30, 40]}`
- **Note**: You must convert this to a pandas DataFrame using `pd.DataFrame(data)` before processing
- `strategy`: String indicating the imputation method (`'mean'`, `'median'`, or `'mode'`)
### 📤 Output
- A pandas DataFrame with missing values filled using the specified strategy
---
### 💻 Task
Implement a Python function `fill_missing_values(df, strategy)` that:
1. Checks the imputation strategy
2. Fills missing values in each column using the appropriate method
3. Returns the filled DataFrame
---
### 🧩 Starter Code
```python
import pandas as pd
import numpy as np
def fill_missing_values(data, strategy='mean'):
"""
Fill missing values using mean, median, or mode.
data (dict): Input data as dictionary (from JSON)
Args:
data (dict or pd.DataFrame): Input data as dictionary or DataFrame
strategy (str): 'mean', 'median', or 'mode'
Returns:
# 🧠 TODO: Convert the input dictionary to a DataFrame using pd.DataFrame(data)
# 🧠 TODO: Create a copy of the DataFrame to avoid modifying the original
# 🧠 TODO: Use df.fillna() with df.mean(), df.median(), or df.mode()
# 🧠 TODO: For mode, use df.mode().iloc[0] to get the first mode if multiple exist
pass
```
---
### 💡 Example 1: Mean Imputation
```python
df = pd.DataFrame({
'A': [1.0, 2.0, np.nan, 4.0, 5.0],
'B': [10.0, np.nan, 30.0, 40.0, 50.0]
})
fill_missing_values(df, strategy='mean')
```
#### Expected Output
```python
A B
0 1.0 10.0
1 2.0 32.5 # Filled with mean of [10, 30, 40, 50] = 32.5
2 3.0 30.0 # Filled with mean of [1, 2, 4, 5] = 3.0
3 4.0 40.0
4 5.0 50.0
```
---
### 💡 Example 2: Median Imputation
```python
df = pd.DataFrame({
'X': [1.0, 5.0, np.nan, 3.0, 7.0],
'Y': [2.0, 4.0, np.nan, 8.0, 10.0]
})
fill_missing_values(df, strategy='median')
```
#### Expected Output
```python
X Y
0 1.0 2.0
1 5.0 4.0
2 4.0 6.0 # Median of [2, 4, 8, 10] = 6.0, Median of [1, 3, 5, 7] = 4.0
3 3.0 8.0
4 7.0 10.0
```
---
### 💡 Example 3: Mode Imputation
```python
df = pd.DataFrame({
'category': ['A', 'B', 'A', np.nan, 'A', 'B']
})
fill_missing_values(df, strategy='mode')
```
#### Expected Output
```python
category
0 A
1 B
2 A
3 A # Filled with mode 'A' (appears 3 times)
4 A
5 B
```
---
### 🔑 Key Pandas Functions
- `df.fillna(value)`: Fill missing values with a specified value or Series
- `df.mean()`: Compute mean for each numerical column
- `df.median()`: Compute median for each numerical column
- `df.mode()`: Compute mode for each column (returns a DataFrame)
- `df.copy()`: Create a copy of the DataFrame to avoid in-place modifications
---
- `df.fillna(value)`: Fill missing values with a specified value or Series
- `df.mean()`: Compute mean for each numerical column