Remove Duplicates Practice Problem
This data science coding problem helps you practice Duplicate Detection & Removal, remove duplicates, and implementation skills. Read the problem statement, write your solution, and strengthen your understanding of Duplicate Detection & Removal.
- Problem ID: 31
- Problem key: 31-remove-duplicates
- URL: https://datacrack.app/solve/31-remove-duplicates
- Difficulty: easy
- Topic: Duplicate Detection & Removal
- Module: Data Cleaning
Problem Statement
# Remove Duplicate Rows
### 🎯 Goal
Remove duplicate rows from a dataset while controlling which occurrence to keep.
### 💻 Task
Implement `remove_duplicates(data, keep='first')` that:
1. Converts the input dictionary to a DataFrame
2. Removes duplicate rows based on the `keep` parameter
3. Returns the cleaned DataFrame with reset index
The `keep` parameter controls behavior:
- `'first'` — keep the first occurrence, drop later ones
- `'last'` — keep the last occurrence, drop earlier ones
- `False` — drop **all** occurrences of duplicates
---
### 📥 Input
- `data`: A dictionary where keys are column names and values are lists of data
- `keep`: `'first'`, `'last'`, or `False`
### 📤 Output
- A pandas DataFrame with duplicates removed and index reset starting from 0
---
### 🧩 Starter Code
```python
import pandas as pd
import numpy as np
def remove_duplicates(data, keep='first'):
"""
Remove duplicate rows from a dataset while
controlling which occurrence to keep.
Args:
data (dict): Input data as dictionary (from JSON)
keep (str or bool): 'first', 'last', or False
Returns:
pd.DataFrame: DataFrame with duplicates removed
"""
# TODO: Convert the input dictionary to a DataFrame
# TODO: Use drop_duplicates() with the keep parameter
# TODO: Reset the index and return the result
pass
```
---
### 💡 Examples
**Example 1:** Keep first
```python
data = {"A": [1, 2, 1, 3], "B": ["x", "y", "x", "z"]}
remove_duplicates(data, keep='first')
```
```
A B
0 1 x
1 2 y
2 3 z
```
**Example 2:** Keep last
```python
remove_duplicates(data, keep='last')
```
```
A B
0 2 y
1 1 x
2 3 z
```
**Example 3:** Drop all duplicates
```python
remove_duplicates(data, keep=False)
```
```
A B
0 2 y
1 3 z
```Starter Code
import pandas as pd
import numpy as np
def remove_duplicates(data, keep='first'):
"""
Remove duplicate rows from a dataset while
controlling which occurrence to keep.
Args:
data (dict): Input data as dictionary (from JSON)
keep (str or bool): 'first', 'last', or False
Returns:
pd.DataFrame: DataFrame with duplicates removed
"""
# TODO: Convert the input dictionary to a DataFrame
# TODO: Use drop_duplicates() with the keep parameter
# TODO: Reset the index and return the result
pass