Group Rare Categories Practice Problem

This data science coding problem helps you practice Categorical Data Cleaning, group rare categories, and implementation skills. Read the problem statement, write your solution, and strengthen your understanding of Categorical Data Cleaning.

Problem ID: 171
Problem key: 171-group-rare-categories
URL: https://datacrack.app/solve/171-group-rare-categories
Difficulty: medium
Topic: Categorical Data Cleaning
Module: Data Cleaning

Problem Statement

# Group Rare Categories

### 🎯 Goal
Datasets often contain categorical columns with many low-frequency values that add noise without adding insight. Grouping these rare categories into a single `"Other"` bucket simplifies analysis and improves the performance of downstream models.

This function identifies categories that appear too infrequently — either below an absolute count or a percentage threshold — and replaces them with `"Other"`.

### 💻 Task
Implement `group_rare_categories(data, column, min_count=None, min_percentage=None)` that:
1. Converts the input dictionary to a DataFrame
2. If `min_count` is provided, replaces categories appearing **fewer than** `min_count` times with `"Other"`
3. If `min_percentage` is provided, replaces categories making up **less than** `min_percentage`% of total rows with `"Other"`
4. Returns the cleaned DataFrame as a dictionary

---

### 📥 Input
- `data`: A dictionary where keys are column names and values are lists
- `column`: The column name to process
- `min_count` *(optional)*: Minimum number of occurrences to keep a category
- `min_percentage` *(optional)*: Minimum percentage of total rows to keep a category

### 📤 Output
- A dictionary representing the cleaned DataFrame

---

### 🧩 Starter Code

```python
import pandas as pd

def group_rare_categories(data, column, min_count=None, min_percentage=None):
    """
    Group rare categories into 'Other' based on count or percentage threshold.

    Args:
        data (dict): Input data as dictionary
        column (str): Column name to process
        min_count (int, optional): Minimum count threshold
        min_percentage (float, optional): Minimum percentage threshold

    Returns:
        dict: Cleaned DataFrame as dictionary
    """
    # TODO: Convert input dictionary to DataFrame
    # TODO: Calculate value counts for the column
    # TODO: Identify rare categories based on min_count or min_percentage
    # TODO: Replace rare categories with 'Other'
    # TODO: Return DataFrame as dictionary
    pass
```

---

### 💡 Examples

**Example 1:** Count-based grouping
```python
data = {"color": ["red", "red", "red", "blue", "blue", "green", "yellow", "purple"]}
group_rare_categories(data, "color", min_count=2)
```
```
{"color": ["red", "red", "red", "blue", "blue", "Other", "Other", "Other"]}
```

**Example 2:** Percentage-based grouping
```python
data = {"fruit": ["apple", "apple", "apple", "apple", "banana", "cherry", "date", "elderberry", "fig", "grape"]}
group_rare_categories(data, "fruit", min_percentage=15)
```
```
{"fruit": ["apple", "apple", "apple", "apple", "Other", "Other", "Other", "Other", "Other", "Other"]}
```

**Example 3:** Department grouping
```python
data = {"dept": ["HR", "HR", "HR", "IT", "IT", "Finance", "Legal"]}
group_rare_categories(data, "dept", min_count=2)
```
```
{"dept": ["HR", "HR", "HR", "IT", "IT", "Other", "Other"]}
```

Group Rare Categories Practice Problem

Problem ID: 171
Problem key: 171-group-rare-categories
URL: https://datacrack.app/solve/171-group-rare-categories
Difficulty: medium
Topic: Categorical Data Cleaning
Module: Data Cleaning

Problem Statement

# Group Rare Categories

### 🎯 Goal
Datasets often contain categorical columns with many low-frequency values that add noise without adding insight. Grouping these rare categories into a single `"Other"` bucket simplifies analysis and improves the performance of downstream models.

This function identifies categories that appear too infrequently — either below an absolute count or a percentage threshold — and replaces them with `"Other"`.

### 💻 Task
Implement `group_rare_categories(data, column, min_count=None, min_percentage=None)` that:
1. Converts the input dictionary to a DataFrame
2. If `min_count` is provided, replaces categories appearing **fewer than** `min_count` times with `"Other"`
3. If `min_percentage` is provided, replaces categories making up **less than** `min_percentage`% of total rows with `"Other"`
4. Returns the cleaned DataFrame as a dictionary

---

### 📥 Input
- `data`: A dictionary where keys are column names and values are lists
- `column`: The column name to process
- `min_count` *(optional)*: Minimum number of occurrences to keep a category
- `min_percentage` *(optional)*: Minimum percentage of total rows to keep a category

### 📤 Output
- A dictionary representing the cleaned DataFrame

---

### 🧩 Starter Code

```python
import pandas as pd

def group_rare_categories(data, column, min_count=None, min_percentage=None):
    """
    Group rare categories into 'Other' based on count or percentage threshold.

    Args:
        data (dict): Input data as dictionary
        column (str): Column name to process
        min_count (int, optional): Minimum count threshold
        min_percentage (float, optional): Minimum percentage threshold

    Returns:
        dict: Cleaned DataFrame as dictionary
    """
    # TODO: Convert input dictionary to DataFrame
    # TODO: Calculate value counts for the column
    # TODO: Identify rare categories based on min_count or min_percentage
    # TODO: Replace rare categories with 'Other'
    # TODO: Return DataFrame as dictionary
    pass
```

---

### 💡 Examples

**Example 1:** Count-based grouping
```python
data = {"color": ["red", "red", "red", "blue", "blue", "green", "yellow", "purple"]}
group_rare_categories(data, "color", min_count=2)
```
```
{"color": ["red", "red", "red", "blue", "blue", "Other", "Other", "Other"]}
```

**Example 2:** Percentage-based grouping
```python
data = {"fruit": ["apple", "apple", "apple", "apple", "banana", "cherry", "date", "elderberry", "fig", "grape"]}
group_rare_categories(data, "fruit", min_percentage=15)
```
```
{"fruit": ["apple", "apple", "apple", "apple", "Other", "Other", "Other", "Other", "Other", "Other"]}
```

**Example 3:** Department grouping
```python
data = {"dept": ["HR", "HR", "HR", "IT", "IT", "Finance", "Legal"]}
group_rare_categories(data, "dept", min_count=2)
```
```
{"dept": ["HR", "HR", "HR", "IT", "IT", "Other", "Other"]}
```

Starter Code

import pandas as pd

def group_rare_categories(data, column, min_count=None, min_percentage=None):
    """
    Group rare categories into 'Other' based on count or percentage threshold.

    Args:
        data (dict): Input data as dictionary
        column (str): Column name to process
        min_count (int, optional): Minimum count threshold
        min_percentage (float, optional): Minimum percentage threshold

    Returns:
        dict: Cleaned DataFrame as dictionary
    """
    # TODO: Convert input dictionary to DataFrame
    # TODO: Calculate value counts for the column
    # TODO: Identify rare categories based on min_count or min_percentage
    # TODO: Replace rare categories with 'Other'
    # TODO: Return DataFrame as dictionary
    pass

Group Rare Categories Practice Problem

Problem Statement

Group Rare Categories Practice Problem

Problem Statement

Starter Code

Internal Links