Outlier Winsorization Practice Problem
This data science coding problem helps you practice Outlier Detection & Treatment, outlier winsorization, and implementation skills. Read the problem statement, write your solution, and strengthen your understanding of Outlier Detection & Treatment.
- Problem ID: 162
- Problem key: 162-outlier-winsorization
- URL: https://datacrack.app/solve/162-outlier-winsorization
- Difficulty: medium
- Topic: Outlier Detection & Treatment
- Module: Data Cleaning
Problem Statement
# Outlier Winsorization
### 🎯 Goal
Cap extreme values in a numeric column to specified percentile bounds instead of removing them — preserving the dataset size while reducing the impact of outliers.
### 💻 Task
Implement `winsorize_outliers(data, column, lower_percentile=5, upper_percentile=95)` that:
1. Converts the input dictionary to a DataFrame
2. Computes the lower and upper percentile values using `np.percentile()`
3. Clips the column values to these bounds using `np.clip()`
4. Rounds the result to 2 decimal places
5. Returns the modified DataFrame as a dictionary
---
### 📥 Input
- `data`: A dictionary where keys are column names and values are lists of numbers
- `column`: The name of the column to winsorize
- `lower_percentile`: Lower percentile for clipping (default: 5)
- `upper_percentile`: Upper percentile for clipping (default: 95)
### 📤 Output
- A dictionary representation of the DataFrame (using `orient='list'`)
---
### 🧩 Starter Code
```python
import pandas as pd
import numpy as np
def winsorize_outliers(data, column, lower_percentile=5, upper_percentile=95):
"""
Cap outliers to percentile-based bounds (winsorization).
Args:
data (dict): Input data as dictionary (from JSON)
column (str): Column name to winsorize
lower_percentile (float): Lower percentile bound (default 5)
upper_percentile (float): Upper percentile bound (default 95)
Returns:
dict: DataFrame as dictionary with winsorized values
"""
# TODO: Convert input dictionary to a DataFrame
# TODO: Compute lower and upper percentile values
# TODO: Clip the column values to the bounds
# TODO: Round results to 2 decimal places
# TODO: Return DataFrame as dictionary
pass
```
---
### 💡 Examples
**Example 1:** Cap extreme values at 10th/90th percentile
```python
data = {"values": [-100, 10, 20, 30, 40, 50, 60, 70, 80, 90, 200]}
winsorize_outliers(data, "values", lower_percentile=10, upper_percentile=90)
```
```
{"values": [10, 10, 20, 30, 40, 50, 60, 70, 80, 90, 90]}
```
**Example 2:** No extreme outliers, fractional bounds
```python
data = {"values": [2, 4, 6, 8, 10, 12, 14, 16, 18, 20]}
winsorize_outliers(data, "values", lower_percentile=5, upper_percentile=95)
```
```
{"values": [2.9, 4.0, 6.0, 8.0, 10.0, 12.0, 14.0, 16.0, 18.0, 19.1]}
```Starter Code
import pandas as pd
import numpy as np
def winsorize_outliers(data, column, lower_percentile=5, upper_percentile=95):
"""
Cap outliers to percentile-based bounds (winsorization).
Args:
data (dict): Input data as dictionary (from JSON)
column (str): Column name to winsorize
lower_percentile (float): Lower percentile bound (default 5)
upper_percentile (float): Upper percentile bound (default 95)
Returns:
dict: DataFrame as dictionary with winsorized values
"""
# TODO: Convert input dictionary to a DataFrame
# TODO: Compute lower and upper percentile values
# TODO: Clip the column values to the bounds
# TODO: Round results to 2 decimal places
# TODO: Return DataFrame as dictionary
pass