Handle Duplicate IDs Practice Problem

This data science coding problem helps you practice Duplicate Detection & Removal, handle duplicate ids, and implementation skills. Read the problem statement, write your solution, and strengthen your understanding of Duplicate Detection & Removal.

Problem ID: 30
Problem key: 30-handle-duplicate-ids
URL: https://datacrack.app/solve/30-handle-duplicate-ids
Difficulty: medium
Topic: Duplicate Detection & Removal
Module: Data Cleaning

Problem Statement

# Handle Duplicate IDs with Different Values

### 🎯 Goal
When rows share the same ID but have different values in other columns, resolve the conflict by aggregating the values.

### 💻 Task
Implement `handle_duplicate_ids(data, id_column, agg_strategy='mean')` that:
1. Converts the input dictionary to a DataFrame
2. Groups rows by the `id_column`
3. Aggregates the remaining columns using the specified strategy
4. Returns the deduplicated DataFrame with reset index

Supported strategies:
- `'mean'` — average of numeric values
- `'sum'` — sum of numeric values
- `'first'` — keep first occurrence's values
- `'last'` — keep last occurrence's values

---

### 📥 Input
- `data`: A dictionary where keys are column names and values are lists of data
- `id_column`: The column name that identifies duplicates
- `agg_strategy`: `'mean'`, `'sum'`, `'first'`, or `'last'`

### 📤 Output
- A pandas DataFrame with one row per unique ID, values aggregated by the chosen strategy, with index reset starting from 0

---

### 🧩 Starter Code

```python
import pandas as pd
import numpy as np

def handle_duplicate_ids(data, id_column, agg_strategy='mean'):
    """
    Resolve duplicate IDs by aggregating their
    values using the specified strategy.

    Args:
        data (dict): Input data as dictionary (from JSON)
        id_column (str): Column name that identifies duplicates
        agg_strategy (str): 'mean', 'sum', 'first', or 'last'

    Returns:
        pd.DataFrame: Deduplicated DataFrame with aggregated values
    """
    # TODO: Convert the input dictionary to a DataFrame
    # TODO: Group rows by the id_column
    # TODO: Apply the aggregation strategy (mean/sum/first/last)
    # TODO: Reset the index and return the result
    pass
```

---

### 💡 Examples

**Example 1:** Mean aggregation
```python
data = {"id": [1, 2, 1, 2, 3], "value": [10, 20, 30, 40, 50]}
handle_duplicate_ids(data, id_column="id", agg_strategy="mean")
```
```
   id  value
0   1   20.0    ← mean(10, 30)
1   2   30.0    ← mean(20, 40)
2   3   50.0
```

**Example 2:** First occurrence
```python
data = {"id": ["A", "B", "A", "B"], "name": ["Alice", "Bob", "Alicia", "Bobby"], "score": [90, 85, 95, 80]}
handle_duplicate_ids(data, id_column="id", agg_strategy="first")
```
```
  id   name  score
0  A  Alice     90
1  B    Bob     85
```

Starter Code

import pandas as pd
import numpy as np

def handle_duplicate_ids(data, id_column, agg_strategy='mean'):
    """
    Resolve duplicate IDs by aggregating their
    values using the specified strategy.

    Args:
        data (dict): Input data as dictionary (from JSON)
        id_column (str): Column name that identifies duplicates
        agg_strategy (str): 'mean', 'sum', 'first', or 'last'

    Returns:
        pd.DataFrame: Deduplicated DataFrame with aggregated values
    """
    # TODO: Convert the input dictionary to a DataFrame
    # TODO: Group rows by the id_column
    # TODO: Apply the aggregation strategy (mean/sum/first/last)
    # TODO: Reset the index and return the result
    pass

Handle Duplicate IDs Practice Problem

Problem ID: 30
Problem key: 30-handle-duplicate-ids
URL: https://datacrack.app/solve/30-handle-duplicate-ids
Difficulty: medium
Topic: Duplicate Detection & Removal
Module: Data Cleaning

Problem Statement

# Handle Duplicate IDs with Different Values

### 🎯 Goal
When rows share the same ID but have different values in other columns, resolve the conflict by aggregating the values.

### 💻 Task
Implement `handle_duplicate_ids(data, id_column, agg_strategy='mean')` that:
1. Converts the input dictionary to a DataFrame
2. Groups rows by the `id_column`
3. Aggregates the remaining columns using the specified strategy
4. Returns the deduplicated DataFrame with reset index

Supported strategies:
- `'mean'` — average of numeric values
- `'sum'` — sum of numeric values
- `'first'` — keep first occurrence's values
- `'last'` — keep last occurrence's values

---

### 📥 Input
- `data`: A dictionary where keys are column names and values are lists of data
- `id_column`: The column name that identifies duplicates
- `agg_strategy`: `'mean'`, `'sum'`, `'first'`, or `'last'`

### 📤 Output
- A pandas DataFrame with one row per unique ID, values aggregated by the chosen strategy, with index reset starting from 0

---

### 🧩 Starter Code

```python
import pandas as pd
import numpy as np

def handle_duplicate_ids(data, id_column, agg_strategy='mean'):
    """
    Resolve duplicate IDs by aggregating their
    values using the specified strategy.

    Args:
        data (dict): Input data as dictionary (from JSON)
        id_column (str): Column name that identifies duplicates
        agg_strategy (str): 'mean', 'sum', 'first', or 'last'

    Returns:
        pd.DataFrame: Deduplicated DataFrame with aggregated values
    """
    # TODO: Convert the input dictionary to a DataFrame
    # TODO: Group rows by the id_column
    # TODO: Apply the aggregation strategy (mean/sum/first/last)
    # TODO: Reset the index and return the result
    pass
```

---

### 💡 Examples

**Example 1:** Mean aggregation
```python
data = {"id": [1, 2, 1, 2, 3], "value": [10, 20, 30, 40, 50]}
handle_duplicate_ids(data, id_column="id", agg_strategy="mean")
```
```
   id  value
0   1   20.0    ← mean(10, 30)
1   2   30.0    ← mean(20, 40)
2   3   50.0
```

**Example 2:** First occurrence
```python
data = {"id": ["A", "B", "A", "B"], "name": ["Alice", "Bob", "Alicia", "Bobby"], "score": [90, 85, 95, 80]}
handle_duplicate_ids(data, id_column="id", agg_strategy="first")
```
```
  id   name  score
0  A  Alice     90
1  B    Bob     85
```

Starter Code

import pandas as pd
import numpy as np

def handle_duplicate_ids(data, id_column, agg_strategy='mean'):
    """
    Resolve duplicate IDs by aggregating their
    values using the specified strategy.

    Args:
        data (dict): Input data as dictionary (from JSON)
        id_column (str): Column name that identifies duplicates
        agg_strategy (str): 'mean', 'sum', 'first', or 'last'

    Returns:
        pd.DataFrame: Deduplicated DataFrame with aggregated values
    """
    # TODO: Convert the input dictionary to a DataFrame
    # TODO: Group rows by the id_column
    # TODO: Apply the aggregation strategy (mean/sum/first/last)
    # TODO: Reset the index and return the result
    pass

Handle Duplicate IDs Practice Problem

Problem Statement

Starter Code

Handle Duplicate IDs Practice Problem

Problem Statement

Starter Code

Internal Links