Fuzzy Duplicate Detection Practice Problem
This data science coding problem helps you practice Duplicate Detection & Removal, fuzzy duplicate detection, and implementation skills. Read the problem statement, write your solution, and strengthen your understanding of Duplicate Detection & Removal.
- Problem ID: 29
- Problem key: 29-fuzzy-duplicate-detection
- URL: https://datacrack.app/solve/29-fuzzy-duplicate-detection
- Difficulty: hard
- Topic: Duplicate Detection & Removal
- Module: Data Cleaning
Problem Statement
# Fuzzy Duplicate Detection using Edit Distance
### 🎯 Goal
Find **near-duplicate** entries in a column by measuring how similar two strings are using Levenshtein distance.
### 💻 Task
Implement `fuzzy_duplicate_detection(data, column, threshold)` that:
1. Converts the input dictionary to a DataFrame
2. Computes the Levenshtein distance between every pair of values in the specified column
3. Returns all pairs where the distance is **≤ threshold**
You must implement the Levenshtein distance yourself (no external libraries).
### 🔣 Levenshtein Distance
The minimum number of single-character edits (insert, delete, substitute) to transform one string into another.
`d("kitten", "sitting") = 3`
---
### 📥 Input
- `data`: A dictionary where keys are column names and values are lists of data
- `column`: The column name to compare values in
- `threshold`: Maximum edit distance to consider a \"fuzzy match\"
### 📤 Output
- A list of dictionaries, each with keys: `row_i`, `row_j`, `value_i`, `value_j`, `distance`
- Only include pairs where `distance ≤ threshold`
- Pairs are ordered by `(row_i, row_j)` with `row_i < row_j`
---
### 🧩 Starter Code
```python
import pandas as pd
import numpy as np
def fuzzy_duplicate_detection(data, column, threshold):
"""
Find near-duplicate entries in a column using
Levenshtein (edit) distance.
Args:
data (dict): Input data as dictionary (from JSON)
column (str): Column name to compare values in
threshold (int): Maximum edit distance for a fuzzy match
Returns:
list[dict]: List of dicts with row_i, row_j, value_i, value_j, distance
"""
# TODO: Implement a helper function for Levenshtein distance
# TODO: Convert the input dictionary to a DataFrame
# TODO: Compare every pair of values in the specified column
# TODO: Return pairs where distance <= threshold
pass
```
---
### 💡 Examples
**Example 1:**
```python
data = {"name": ["John", "Jon", "Jane", "Janet"]}
fuzzy_duplicate_detection(data, column="name", threshold=1)
```
```python
[
{"row_i": 0, "row_j": 1, "value_i": "John", "value_j": "Jon", "distance": 1},
{"row_i": 2, "row_j": 3, "value_i": "Jane", "value_j": "Janet", "distance": 1}
]
```
**Example 2:**
```python
data = {"city": ["New York", "new york", "Los Angeles", "San Francisco"]}
fuzzy_duplicate_detection(data, column="city", threshold=2)
```
```python
[
{"row_i": 0, "row_j": 1, "value_i": "New York", "value_j": "new york", "distance": 2}
]
```