Subset Duplicate Detection Practice Problem

This data science coding problem helps you practice Duplicate Detection & Removal, subset duplicate detection, and implementation skills. Read the problem statement, write your solution, and strengthen your understanding of Duplicate Detection & Removal.

Problem ID: 32
Problem key: 32-subset-duplicate-detection
URL: https://datacrack.app/solve/32-subset-duplicate-detection
Difficulty: easy
Topic: Duplicate Detection & Removal
Module: Data Cleaning

Problem Statement

# Subset Duplicate Detection

### 🎯 Goal
Identify duplicate rows by comparing only a **subset** of columns rather than the entire row.

### 💻 Task
Implement `find_duplicates_subset(data, subset)` that:
1. Converts the input dictionary to a DataFrame
2. Finds rows that are duplicated based only on the specified subset of columns
3. Returns **all** matching rows (both originals and copies) with all their columns, with reset index

---

### 📥 Input
- `data`: A dictionary where keys are column names and values are lists of data
- `subset`: A list of column names to check for duplicates

### 📤 Output
- A pandas DataFrame containing all rows where the `subset` columns have duplicate values (including both occurrences), with index reset starting from 0

---

### 🧩 Starter Code

```python
import pandas as pd
import numpy as np

def find_duplicates_subset(data, subset):
    """
    Find duplicate rows by comparing only a
    subset of columns in the dataset.

    Args:
        data (dict): Input data as dictionary (from JSON)
        subset (list): Column names to check for duplicates

    Returns:
        pd.DataFrame: DataFrame with rows that are duplicated on the subset columns
    """
    # TODO: Convert the input dictionary to a DataFrame
    # TODO: Use duplicated() with the subset parameter to flag duplicates
    # TODO: Return all flagged rows with reset index
    pass
```

---

### 💡 Examples

**Example 1:** Subset on `name` only
```python
data = {
    "name": ["Alice", "Bob", "Alice", "Charlie"],
    "age": [25, 30, 26, 35],
    "city": ["NY", "LA", "NY", "SF"]
}
find_duplicates_subset(data, subset=["name"])
```
```
    name  age city
0  Alice   25   NY    ← Same name despite different ages
1  Alice   26   NY
```

**Example 2:** Subset on multiple columns
```python
data = {"dept": ["HR", "IT", "HR", "IT"], "salary": [50000, 60000, 50000, 70000]}
find_duplicates_subset(data, subset=["dept", "salary"])
```
```
  dept  salary
0   HR   50000
1   HR   50000
```

Starter Code

import pandas as pd
import numpy as np

def find_duplicates_subset(data, subset):
    """
    Find duplicate rows by comparing only a
    subset of columns in the dataset.

    Args:
        data (dict): Input data as dictionary (from JSON)
        subset (list): Column names to check for duplicates

    Returns:
        pd.DataFrame: DataFrame with rows that are duplicated on the subset columns
    """
    # TODO: Convert the input dictionary to a DataFrame
    # TODO: Use duplicated() with the subset parameter to flag duplicates
    # TODO: Return all flagged rows with reset index
    pass

Subset Duplicate Detection Practice Problem

Problem ID: 32
Problem key: 32-subset-duplicate-detection
URL: https://datacrack.app/solve/32-subset-duplicate-detection
Difficulty: easy
Topic: Duplicate Detection & Removal
Module: Data Cleaning

Problem Statement

# Subset Duplicate Detection

### 🎯 Goal
Identify duplicate rows by comparing only a **subset** of columns rather than the entire row.

### 💻 Task
Implement `find_duplicates_subset(data, subset)` that:
1. Converts the input dictionary to a DataFrame
2. Finds rows that are duplicated based only on the specified subset of columns
3. Returns **all** matching rows (both originals and copies) with all their columns, with reset index

---

### 📥 Input
- `data`: A dictionary where keys are column names and values are lists of data
- `subset`: A list of column names to check for duplicates

### 📤 Output
- A pandas DataFrame containing all rows where the `subset` columns have duplicate values (including both occurrences), with index reset starting from 0

---

### 🧩 Starter Code

```python
import pandas as pd
import numpy as np

def find_duplicates_subset(data, subset):
    """
    Find duplicate rows by comparing only a
    subset of columns in the dataset.

    Args:
        data (dict): Input data as dictionary (from JSON)
        subset (list): Column names to check for duplicates

    Returns:
        pd.DataFrame: DataFrame with rows that are duplicated on the subset columns
    """
    # TODO: Convert the input dictionary to a DataFrame
    # TODO: Use duplicated() with the subset parameter to flag duplicates
    # TODO: Return all flagged rows with reset index
    pass
```

---

### 💡 Examples

**Example 1:** Subset on `name` only
```python
data = {
    "name": ["Alice", "Bob", "Alice", "Charlie"],
    "age": [25, 30, 26, 35],
    "city": ["NY", "LA", "NY", "SF"]
}
find_duplicates_subset(data, subset=["name"])
```
```
    name  age city
0  Alice   25   NY    ← Same name despite different ages
1  Alice   26   NY
```

**Example 2:** Subset on multiple columns
```python
data = {"dept": ["HR", "IT", "HR", "IT"], "salary": [50000, 60000, 50000, 70000]}
find_duplicates_subset(data, subset=["dept", "salary"])
```
```
  dept  salary
0   HR   50000
1   HR   50000
```

Starter Code

import pandas as pd
import numpy as np

def find_duplicates_subset(data, subset):
    """
    Find duplicate rows by comparing only a
    subset of columns in the dataset.

    Args:
        data (dict): Input data as dictionary (from JSON)
        subset (list): Column names to check for duplicates

    Returns:
        pd.DataFrame: DataFrame with rows that are duplicated on the subset columns
    """
    # TODO: Convert the input dictionary to a DataFrame
    # TODO: Use duplicated() with the subset parameter to flag duplicates
    # TODO: Return all flagged rows with reset index
    pass

Subset Duplicate Detection Practice Problem

Problem Statement

Starter Code

Subset Duplicate Detection Practice Problem

Problem Statement

Starter Code

Internal Links