Subset Duplicate Detection Practice Problem
This data science coding problem helps you practice Duplicate Detection & Removal, subset duplicate detection, and implementation skills. Read the problem statement, write your solution, and strengthen your understanding of Duplicate Detection & Removal.
- Problem ID: 32
- Problem key: 32-subset-duplicate-detection
- URL: https://datacrack.app/solve/32-subset-duplicate-detection
- Difficulty: easy
- Topic: Duplicate Detection & Removal
- Module: Data Cleaning
Problem Statement
# Subset Duplicate Detection
### 🎯 Goal
Identify duplicate rows by comparing only a **subset** of columns rather than the entire row.
### 💻 Task
Implement `find_duplicates_subset(data, subset)` that:
1. Converts the input dictionary to a DataFrame
2. Finds rows that are duplicated based only on the specified subset of columns
3. Returns **all** matching rows (both originals and copies) with all their columns, with reset index
---
### 📥 Input
- `data`: A dictionary where keys are column names and values are lists of data
- `subset`: A list of column names to check for duplicates
### 📤 Output
- A pandas DataFrame containing all rows where the `subset` columns have duplicate values (including both occurrences), with index reset starting from 0
---
### 🧩 Starter Code
```python
import pandas as pd
import numpy as np
def find_duplicates_subset(data, subset):
"""
Find duplicate rows by comparing only a
subset of columns in the dataset.
Args:
data (dict): Input data as dictionary (from JSON)
subset (list): Column names to check for duplicates
Returns:
pd.DataFrame: DataFrame with rows that are duplicated on the subset columns
"""
# TODO: Convert the input dictionary to a DataFrame
# TODO: Use duplicated() with the subset parameter to flag duplicates
# TODO: Return all flagged rows with reset index
pass
```
---
### 💡 Examples
**Example 1:** Subset on `name` only
```python
data = {
"name": ["Alice", "Bob", "Alice", "Charlie"],
"age": [25, 30, 26, 35],
"city": ["NY", "LA", "NY", "SF"]
}
find_duplicates_subset(data, subset=["name"])
```
```
name age city
0 Alice 25 NY ← Same name despite different ages
1 Alice 26 NY
```
**Example 2:** Subset on multiple columns
```python
data = {"dept": ["HR", "IT", "HR", "IT"], "salary": [50000, 60000, 50000, 70000]}
find_duplicates_subset(data, subset=["dept", "salary"])
```
```
dept salary
0 HR 50000
1 HR 50000
```Starter Code
import pandas as pd
import numpy as np
def find_duplicates_subset(data, subset):
"""
Find duplicate rows by comparing only a
subset of columns in the dataset.
Args:
data (dict): Input data as dictionary (from JSON)
subset (list): Column names to check for duplicates
Returns:
pd.DataFrame: DataFrame with rows that are duplicated on the subset columns
"""
# TODO: Convert the input dictionary to a DataFrame
# TODO: Use duplicated() with the subset parameter to flag duplicates
# TODO: Return all flagged rows with reset index
pass