Remove Rows with Missing Data Practice Problem
This data science coding problem helps you practice Missing Data Handling, remove rows with missing data, and implementation skills. Read the problem statement, write your solution, and strengthen your understanding of Missing Data Handling.
- Problem ID: 27
- Problem key: 27-remove-rows-with-missing-data
- URL: https://datacrack.app/solve/27-remove-rows-with-missing-data
- Difficulty: easy
- Topic: Missing Data Handling
- Module: Data Cleaning
Problem Statement
# 🧩 Remove Rows with Missing Data
---
### 🎯 Goal
Sometimes the best way to handle missing data is to **remove** rows containing them.
This is appropriate when:
- Missing data is minimal (< 5% of rows)
- Imputation would introduce too much bias
- You have enough data to afford losing some rows
Pandas provides flexible options for dropping rows based on missing values.
---
### 🔍 Dropping Strategies
| Parameter | Behavior | Example |
|:----------|:---------|:--------|
| `how='any'` | Drop row if **any** column has NaN | `[1, NaN, 3]` → dropped |
| `how='all'` | Drop row only if **all** columns are NaN | `[NaN, NaN, NaN]` → dropped |
| `thresh=n` | Drop row if it has **fewer than n** non-null values | `thresh=2` → keep rows with ≥2 valid values |
| `subset=['A', 'B']` | Only consider specific columns for dropping | Ignore NaNs in other columns |
---
### 📥 Input
- `df`: A pandas DataFrame with missing values
- `how`: String (`'any'` or `'all'`) or `None` if using `thresh`
- `thresh`: Integer (minimum number of non-null values to keep row)
- `subset`: List of column names to check (optional)
### 📤 Output
- A pandas DataFrame with rows removed based on the criteria
---
### 💻 Task
Implement a Python function `remove_missing_rows(df, how='any', thresh=None, subset=None)` that:
1. Validates the parameters
2. Drops rows according to the specified criteria
3. Returns the cleaned DataFrame
---
### 🧩 Starter Code
```python
import pandas as pd
import numpy as np
def remove_missing_rows(data, how='any', thresh=None, subset=None):
"""
Remove rows with missing data based on specified criteria.
Args:
data (dict): Input data as dictionary (from JSON)
how (str): 'any' or 'all' (ignored if thresh is specified)
thresh (int): Minimum number of non-null values required
subset (list): Column names to consider
Returns:
pd.DataFrame: DataFrame with rows removed
"""
# 🧠 TODO: Convert the input dictionary to a DataFrame using pd.DataFrame(data)
# 🧠 TODO: Use df.dropna() with appropriate parameters
# 🧠 TODO: Handle the thresh vs how parameter logic
# 🧠 TODO: Reset index after dropping rows
pass
```
---
### 💡 Example 1: Drop Any Row with NaN (`how='any'`)
```python
df = pd.DataFrame({
'A': [1, 2, np.nan, 4],
'B': [5, np.nan, 7, 8],
'C': [9, 10, 11, 12]
})
remove_missing_rows(df, how='any')
```
#### Expected Output
```python
A B C
0 1 5 9 # Row 0: No NaNs ✓
1 4 8 12 # Row 3: No NaNs ✓
# Rows 1 and 2 dropped (contain NaN)
```
---
### 💡 Example 2: Drop Only Rows Where All Values are NaN (`how='all'`)
```python
df = pd.DataFrame({
'X': [np.nan, np.nan, 3],
'Y': [1, np.nan, 3],
'Z': [1, np.nan, 3]
})
remove_missing_rows(df, how='all')
```
#### Expected Output
```python
X Y Z
0 NaN 1 1 # Row 0: Has some valid values ✓
1 3 3 3 # Row 2: All valid ✓
# Row 1 NOT dropped (Y and Z have values, only X is NaN)
```
---
### 💡 Example 3: Threshold-Based Dropping (`thresh`)
```python
df = pd.DataFrame({
'A': [1, np.nan, np.nan, 4, 5],
'B': [np.nan, 2, np.nan, 4, 5],
'C': [1, np.nan, 3, 4, 5]
})
remove_missing_rows(df, thresh=2) # Keep rows with ≥2 non-null values
```
#### Expected Output
```python
A B C
0 1 NaN 1 # 2 non-null (A, C) ✓
1 4 4 4 # 3 non-null ✓
2 5 5 5 # 3 non-null ✓
# Row 1 (1 non-null) and Row 2 (1 non-null) dropped
```
---
### 💡 Example 4: Subset-Based Dropping
```python
df = pd.DataFrame({
'ID': [1, 2, 3, 4],
'Name': ['Alice', 'Bob', np.nan, 'David'],
'Score': [85, np.nan, 90, 88]
})
# Only drop if Name OR Score is missing
remove_missing_rows(df, how='any', subset=['Name', 'Score'])
```
#### Expected Output
```python
ID Name Score
0 1 Alice 85.0 # No NaNs in Name/Score ✓
1 4 David 88.0 # No NaNs in Name/Score ✓
# Rows 1 and 2 dropped (NaN in subset columns)
```
---
### 🔑 Key Pandas Functions
- `df.dropna()`: Remove rows/columns with missing values
- `how='any'`: Drop if any value is NaN
- `how='all'`: Drop only if all values are NaN
- `thresh=n`: Drop if fewer than n non-null values
- `subset=[...]`: Consider only specified columns
- `df.reset_index(drop=True)`: Reset row indices after dropping
---
### ⚠️ Important Considerations
**1. Data Loss**
- Dropping rows reduces dataset size
**2. Bias Introduction**
- If missingness is not random (e.g., high earners don't report salary), dropping introduces **selection bias**
**3. When to Drop vs. Impute**
| Scenario | Recommendation |
|:---------|:---------------|
| < 5% missing | Safe to drop |
| 5-20% missing | Consider imputation |
| > 20% missing | Investigate pattern, likely impute |
| Missing Not At Random | Advanced methods or domain expertise |
---