Handle Unknown Categories Practice Problem
This data science coding problem helps you practice Categorical Data Cleaning, handle unknown categories, and implementation skills. Read the problem statement, write your solution, and strengthen your understanding of Categorical Data Cleaning.
- Problem ID: 172
- Problem key: 172-handle-unknown-categories
- URL: https://datacrack.app/solve/172-handle-unknown-categories
- Difficulty: easy
- Topic: Categorical Data Cleaning
- Module: Data Cleaning
Problem Statement
# Handle Unknown Categories
### 🎯 Goal
When new or unexpected category values appear in production data — values that weren't present during training or aren't part of the valid set — they can break models or produce misleading results. Replacing unknown categories with a safe fallback value ensures robust data pipelines.
This function checks each value against a list of known valid categories and replaces anything not on the list with a configurable fill value.
### 💻 Task
Implement `handle_unknown_categories(data, column, known_categories, fill_value="Unknown")` that:
1. Converts the input dictionary to a DataFrame
2. Replaces any value in the specified column that is **not** in `known_categories` with `fill_value`
3. Returns the cleaned DataFrame as a dictionary
---
### 📥 Input
- `data`: A dictionary where keys are column names and values are lists
- `column`: The column name to check
- `known_categories`: A list of valid category values
- `fill_value` *(optional, default `"Unknown"`)*: Replacement value for unknown categories
### 📤 Output
- A dictionary representing the cleaned DataFrame
---
### 🧩 Starter Code
```python
import pandas as pd
def handle_unknown_categories(data, column, known_categories, fill_value="Unknown"):
"""
Replace categories not in the known list with a fill value.
Args:
data (dict): Input data as dictionary
column (str): Column name to check
known_categories (list): List of valid category values
fill_value (str): Replacement value for unknown categories
Returns:
dict: Cleaned DataFrame as dictionary
"""
# TODO: Convert input dictionary to DataFrame
# TODO: Identify values not in known_categories
# TODO: Replace unknown values with fill_value
# TODO: Return DataFrame as dictionary
pass
```
---
### 💡 Examples
**Example 1:** Unknown color
```python
data = {"color": ["red", "blue", "green", "purple", "red"]}
handle_unknown_categories(data, "color", ["red", "blue", "green"], "Unknown")
```
```
{"color": ["red", "blue", "green", "Unknown", "red"]}
```
**Example 2:** Custom fill value
```python
data = {"size": ["S", "M", "L", "XL", "XXL"]}
handle_unknown_categories(data, "size", ["S", "M", "L", "XL"], "Other")
```
```
{"size": ["S", "M", "L", "XL", "Other"]}
```
**Example 3:** Multiple unknowns
```python
data = {"status": ["active", "inactive", "pending", "deleted", "active"]}
handle_unknown_categories(data, "status", ["active", "inactive"], "Unknown")
```
```
{"status": ["active", "inactive", "Unknown", "Unknown", "active"]}
```