Handle Duplicate IDs Practice Problem
This data science coding problem helps you practice Duplicate Detection & Removal, handle duplicate ids, and implementation skills. Read the problem statement, write your solution, and strengthen your understanding of Duplicate Detection & Removal.
- Problem ID: 30
- Problem key: 30-handle-duplicate-ids
- URL: https://datacrack.app/solve/30-handle-duplicate-ids
- Difficulty: medium
- Topic: Duplicate Detection & Removal
- Module: Data Cleaning
Problem Statement
# Handle Duplicate IDs with Different Values
### 🎯 Goal
When rows share the same ID but have different values in other columns, resolve the conflict by aggregating the values.
### 💻 Task
Implement `handle_duplicate_ids(data, id_column, agg_strategy='mean')` that:
1. Converts the input dictionary to a DataFrame
2. Groups rows by the `id_column`
3. Aggregates the remaining columns using the specified strategy
4. Returns the deduplicated DataFrame with reset index
Supported strategies:
- `'mean'` — average of numeric values
- `'sum'` — sum of numeric values
- `'first'` — keep first occurrence's values
- `'last'` — keep last occurrence's values
---
### 📥 Input
- `data`: A dictionary where keys are column names and values are lists of data
- `id_column`: The column name that identifies duplicates
- `agg_strategy`: `'mean'`, `'sum'`, `'first'`, or `'last'`
### 📤 Output
- A pandas DataFrame with one row per unique ID, values aggregated by the chosen strategy, with index reset starting from 0
---
### 🧩 Starter Code
```python
import pandas as pd
import numpy as np
def handle_duplicate_ids(data, id_column, agg_strategy='mean'):
"""
Resolve duplicate IDs by aggregating their
values using the specified strategy.
Args:
data (dict): Input data as dictionary (from JSON)
id_column (str): Column name that identifies duplicates
agg_strategy (str): 'mean', 'sum', 'first', or 'last'
Returns:
pd.DataFrame: Deduplicated DataFrame with aggregated values
"""
# TODO: Convert the input dictionary to a DataFrame
# TODO: Group rows by the id_column
# TODO: Apply the aggregation strategy (mean/sum/first/last)
# TODO: Reset the index and return the result
pass
```
---
### 💡 Examples
**Example 1:** Mean aggregation
```python
data = {"id": [1, 2, 1, 2, 3], "value": [10, 20, 30, 40, 50]}
handle_duplicate_ids(data, id_column="id", agg_strategy="mean")
```
```
id value
0 1 20.0 ← mean(10, 30)
1 2 30.0 ← mean(20, 40)
2 3 50.0
```
**Example 2:** First occurrence
```python
data = {"id": ["A", "B", "A", "B"], "name": ["Alice", "Bob", "Alicia", "Bobby"], "score": [90, 85, 95, 80]}
handle_duplicate_ids(data, id_column="id", agg_strategy="first")
```
```
id name score
0 A Alice 90
1 B Bob 85
```Starter Code
import pandas as pd
import numpy as np
def handle_duplicate_ids(data, id_column, agg_strategy='mean'):
"""
Resolve duplicate IDs by aggregating their
values using the specified strategy.
Args:
data (dict): Input data as dictionary (from JSON)
id_column (str): Column name that identifies duplicates
agg_strategy (str): 'mean', 'sum', 'first', or 'last'
Returns:
pd.DataFrame: Deduplicated DataFrame with aggregated values
"""
# TODO: Convert the input dictionary to a DataFrame
# TODO: Group rows by the id_column
# TODO: Apply the aggregation strategy (mean/sum/first/last)
# TODO: Reset the index and return the result
pass