Group Rare Categories Practice Problem
This data science coding problem helps you practice Categorical Data Cleaning, group rare categories, and implementation skills. Read the problem statement, write your solution, and strengthen your understanding of Categorical Data Cleaning.
- Problem ID: 171
- Problem key: 171-group-rare-categories
- URL: https://datacrack.app/solve/171-group-rare-categories
- Difficulty: medium
- Topic: Categorical Data Cleaning
- Module: Data Cleaning
Problem Statement
# Group Rare Categories
### 🎯 Goal
Datasets often contain categorical columns with many low-frequency values that add noise without adding insight. Grouping these rare categories into a single `"Other"` bucket simplifies analysis and improves the performance of downstream models.
This function identifies categories that appear too infrequently — either below an absolute count or a percentage threshold — and replaces them with `"Other"`.
### 💻 Task
Implement `group_rare_categories(data, column, min_count=None, min_percentage=None)` that:
1. Converts the input dictionary to a DataFrame
2. If `min_count` is provided, replaces categories appearing **fewer than** `min_count` times with `"Other"`
3. If `min_percentage` is provided, replaces categories making up **less than** `min_percentage`% of total rows with `"Other"`
4. Returns the cleaned DataFrame as a dictionary
---
### 📥 Input
- `data`: A dictionary where keys are column names and values are lists
- `column`: The column name to process
- `min_count` *(optional)*: Minimum number of occurrences to keep a category
- `min_percentage` *(optional)*: Minimum percentage of total rows to keep a category
### 📤 Output
- A dictionary representing the cleaned DataFrame
---
### 🧩 Starter Code
```python
import pandas as pd
def group_rare_categories(data, column, min_count=None, min_percentage=None):
"""
Group rare categories into 'Other' based on count or percentage threshold.
Args:
data (dict): Input data as dictionary
column (str): Column name to process
min_count (int, optional): Minimum count threshold
min_percentage (float, optional): Minimum percentage threshold
Returns:
dict: Cleaned DataFrame as dictionary
"""
# TODO: Convert input dictionary to DataFrame
# TODO: Calculate value counts for the column
# TODO: Identify rare categories based on min_count or min_percentage
# TODO: Replace rare categories with 'Other'
# TODO: Return DataFrame as dictionary
pass
```
---
### 💡 Examples
**Example 1:** Count-based grouping
```python
data = {"color": ["red", "red", "red", "blue", "blue", "green", "yellow", "purple"]}
group_rare_categories(data, "color", min_count=2)
```
```
{"color": ["red", "red", "red", "blue", "blue", "Other", "Other", "Other"]}
```
**Example 2:** Percentage-based grouping
```python
data = {"fruit": ["apple", "apple", "apple", "apple", "banana", "cherry", "date", "elderberry", "fig", "grape"]}
group_rare_categories(data, "fruit", min_percentage=15)
```
```
{"fruit": ["apple", "apple", "apple", "apple", "Other", "Other", "Other", "Other", "Other", "Other"]}
```
**Example 3:** Department grouping
```python
data = {"dept": ["HR", "HR", "HR", "IT", "IT", "Finance", "Legal"]}
group_rare_categories(data, "dept", min_count=2)
```
```
{"dept": ["HR", "HR", "HR", "IT", "IT", "Other", "Other"]}
```